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1^ ' Abstract 

OO I In this paper, we propose a semi-parametric model for autonomous nonlinear dynamical 

systems and devise an estimation procedure for model fitting. This model incorporates subject- 
specific effects and can be viewed as a nonlinear semi-parametric mixed effects model. We 
["T^l ■ also propose a computationally efhcient model selection procedure. We prove consistency of the 

proposed estimator under suitable regularity conditions. We show by simulation studies that the 
proposed estimation as well as model selection procedures can efficiently handle sparse and noisy 
measurements. Finally, we apply the proposed method to a plant growth data used to study 
growth displacement rates within meristems of maize roots under two different experimental 
conditions. 
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in 

cn ■ 1 Introduction 

' Continuous time dynamical systems arise, among other places, in modeling certain biological pro- 

Q!^ ■ cesses. For example, in plant science, the spatial distribution of growth is an active area of research 

(Basu et al, 2007; Schurr, Walter and Rascher, 2006; van der Weele et al., 2003; Walter et al, 
2002). One particular region of interest is the root apex, which is characterized by cell division, 
^ ' rapid cell expansion and cell differentiation. A single cell can be followed over time, and thus it 

is relatively easy to measure its cell division rate. However, in a meristeni^, there is a changing 
population of dividing cells. Thus the cell division rate, which is defined as the local rate of forma- 
tion of cells, is not directly observable. If one observes root development from an origin attached 
to the apex, tissue elements appear to flow through, giving an analogy between primary growth 
in plant root and fluid flow (Silk, 1994). Thus in Sacks, Silk and Burman (1997), the authors 
propose to estimate the cell division rates by a continuity equation that is based on the principle 
of conservation of mass. Specifically, if we assume a steady growth, then the cell division rate 
is estimated as the gradient (with respect to distance) of cell flux - the rate at which cells are 
moving past a spatial point. Cell flux is the product of cell number density and growth velocity 
field. The former can be found by counting the number of cells per small unit file. The latter is 
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'^meristem is the tissue in plants consisting of undifferentiated cells and found in zones of the plant where growth 
can take place. 
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Figure 1: Left Panel: image of root tip with meristem*: 1 - meristem; 4 - root cap; 5 - elongation 
zone; Right Panel: an illustration of the root tip with the displacements of three markers indicated 
at times to > ^1) ^2) ^3- Cf^o 

m wikipedia) 




the rate of displacement of a particle placed along the root and thus it is a function of distance 
from the root apex. Hereafter we refer to it as the growth displacement rate. Note that, growth 
displacement rate is not to be confused with "growth rate" which usually refers to the derivative of 
the growth trajectory with respect to time. For more details, see Sacks et al. (1997). The growth 
displacement rate is also needed for understanding some important physiological processes such as 
biosynthesis (Silk and Erickson, 1979; Schurr et al, 2006). Moreover, a useful growth descriptor 
called the "relative elemental growth rate" (REGR) can be calculated as the gradient of the growth 
displacement rate (with respect to distance), which shows quantitatively the magnitude of growth 
at each location within the organ. 

There are a lot of research aiming to understand the effect of environmental conditions on the 
growth in plant. For example, root growth is highly sensitive to environmental factors such as 
temperature, water deficit or nutrients ( Schurr et al., 2006; Walter et al., 2002). For example, in 
Sharp, Silk and Hsiao (1988), the authors study the effect of water potential on the root elongation 
in maize primary roots. Root elongation has considerable physiological advantages in drying soil, 
and therefore knowledge of the locations and magnitudes of growth response to water potential 
facilitates the quantitative understanding of the underlying regulatory process. In Sacks et al. 
(1997), an experiment is conducted to study the effect of water stress on cortical cell division rates 
through growth displacement rate within the meristem of the primary root of maize seedlings. In 
this study, for each plant, measurements are taken on the displacement, measured as the distance 
in millimeters from the root cap junction (root apex), of a number of markers on the root over 
a period of 12 hours (Fig. [T) right panel). The plants are divided into two groups - a control 
group under normal water availability; and a treatment group under a water stress. In Fig. [2l the 
growth (displacement) trajectories of one plant with 28 markers in the control group, and another 
plant with 26 markers in the treatment group are depicted. The meristem region of the root, where 
the measurements are taken, is shown in Fig. [1] (left panel). Note that, by definition, the growth 
displacement rate characterizes the relationship between the growth trajectory and its derivative 
(with respect to time). Thus it is simply the gradient function in the corresponding dynamical 
system. (See Section [2] for more details). 

Motivated by this study, in this paper, we focus on modeling and fitting the underlying dynam- 
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Figure 2: Growth trajectories for plant data. Left panel : a plant in control group; Right panel : 
a plant in treatment group 
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ical system based on data measured over time (referred as sample curves or sample paths) for a 
group of subjects. In particular, we are interested in the case where there are multiple replicates 
corresponding to different initial conditions for each subject. Moreover, for a given initial condition, 
instead of observing the whole sample path, measurements are taken only at a sparse set of time 
points together with (possible) measurement noise. In the plant data application, each plant is a 
subject. And the positions of the markers which are located at different distances at time zero from 
the root cap junction correspond to different initial conditions. There are in total 19 plants and 445 
sample curves in this study. The number of replicates (i.e. markers) for each plant varies between 
10 and 31. Moreover, smoothness of the growth trajectories indicates low observational noise levels 
and an absence of extraneous shocks in the system. Hence, in this paper, we model the growth 
trajectories through deterministic differential equations with plant-specific effects. We refer to the 
(common) gradient function of these differential equations as the baseline growth displacement rate. 

We first give a brief overview of the existing literature on fitting smooth dynamical systems in 
continuous time. A large number of physical, chemical or biological processes are modeled through 
systems of parametric differential equations (cf. Ljung and Glad, 1994, Perthame, 2007, Strogatz, 
2001). Ramsay, Hooker, Campbell and Cao (2007) consider modeling a continuously stirred tank 
reactor. Zhu and Wu (2007) adopt a state space approach for estimating the dynamics of cell-virus 
interactions in an AIDS clinical trial. Poyton et al. (2006) use the principal differential analysis 
approach to fit dynamical systems. Recently Chen and Wu (2008a, 2008b) propose to estimate 
differential equations with known functional forms and nonparametric time-dependent coefficients. 
Wu and Ding (1999) and Wu, Ding and DeGruttola (1998) propose using nonlinear least squares 
procedure for fitting differential equations that take into account subject-specific effects. In a recent 
work, Cao, Fussmann and Ramsay (2008) model a nonlinear dynamical system using splines with 
predetermined knots for describing the gradient function. Most of the existing approaches assume 
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known functional forms of the dynamical system; and many of them require data measured on a 
dense grid (e.g., Varah, 1982; Zhu and Wu, 2007). 

For the problems that we are interested in this paper, measurements are taken on a sparse set 
of points for each sample curve. Thus numerical procedures for solving differential equations can 
become unstable if we treat each sample curve separately. Moreover, we are more interested in 
estimating the baseline dynamics than the individual dynamics of each subject. For example, in 
the plant study described above, we are interested in comparing the growth displacement rates 
(as a function of distance from the root cap junction) under two different experimental conditions. 
On the other hand, we are not so interested in the displacement rate corresponding to each plant. 
Another important aspect in modeling data with multiple subjects is that adequate measures need 
to be taken to model possible subject-specific effects, otherwise the estimates of model parameters 
can have inflated variability. Thus in this paper, we incorporate subject-specific effects into the 
model while combining information across different subjects. In addition, because of insufficient 
knowledge of the problem as is the case for the plant growth study, in practice one often has to resort 
to modeling the dynamical system nonparametrically. For example, there is controversy among 
plant scientists about whether there is a growth bump in the middle of the meristem. There are 
also some natural boundary constraints of the growth displacement rate, making it hard to specify 
a simple and interpretable parametric system. (See more discussions in Section [3]). Therefore, 
in this paper, we propose to model the baseline dynamics nonparametrrically through a basis 
representation approach. We use an estimation procedure that combines nonlinear optimization 
techniques with a numerical ODE solver to estimate the unknown parameters. In addition, we derive 
a computationally efficient approximation of the leave-one-curve-out cross validation score for model 
selection. We prove consistency of the proposed estimators under appropriate regularity conditions. 
Our asymptotic scenario involves keeping the number of subjects fixed and allowing the number of 
measurements per subject to grow to infinity. The analysis differs from the usual nonparametric 
regression problems due to the structures imposed by the differential equations model. We show 
by simulation studies that the proposed approach can efficiently estimate the baseline dynamics 
under the setting of multiple replicates per subject with sparse noisy measurements. Moreover, 
the proposed model selection procedure is effective in maintaining a balance between fidelity to 
the data and to the underlying model. Finally, we apply the proposed method to the plant data 
described earlier and compare the estimated growth displacement rates under the two experimental 
conditions. 

The rest of paper is organized as follows. In Section [21 we describe the proposed model. In 
Sections [3] and HI we discuss the model fitting and model selection procedures, respectively. In 
Section \5\ we prove consistency of the proposed estimator. In Section [6l we conduct simulation 
studies to illustrate finite sample performance of the proposed method. Section [7| is the application 
of this method to the plant data. Technical details are in the appendices. An R package dynamics 
for fitting the model described in this paper is available upon request. 

2 Model 

In this section, we describe a class of autonomous dynamical systems that is suitable for modeling 
the problems exemplified by the plant data (Section [T|). An autonomous dynamical system has the 
following general form: 

X'{t)=f{X{t)), te[To,Ti]. 
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Figure 3: Empirical derivatives (divided differences) X'{t) against empirical fits (averaged mea- 
surements) X(t) for treatment group. 




Without loss of generality, henceforth Tq = and Ti = 1. Note that, the above equation means 
that X{t) = a + Jq f{X{u))du, where a = X{0) is the initial condition. Thus in an autonomous 
system, the dynamics (which is characterized by /) depends on time t only through X{t). This 
type of systems arises in various scientific studies such as modelling prey-predator dynamics, virus 
dynamics, or epidemiology (cf. Perthame, 2007). Many studies in plant science such as Silk (1994), 
Sacks et al. (1997), Fraser, Silk and Rost (1990) all suggest reasonably steady growth velocity 
across the meristem under both normal and water-stress conditions at an early developmental 
stage. Moreover, exploratory regression analysis based on empirical derivatives and empirical fits 
of the growth trajectories indicates that time is not a significant predictor and thus an autonomous 
model is reasonable. This assumption is equivalent to the assertion that the growth displacement 
rate depends only on the distance from the root cap junction. It means that time zero does not 
play a role in terms of estimating the dynamical system and there is also no additional variation 
associated with individual markers. 

Figure [3] shows the scatter plot of empirical derivatives versus empirical fits in the treatment 
group. It indicates that there is an increase in the growth displacement rate starting from a zero 
rate at the root cap junction, then followed by a nearly constant rate beyond a certain location. 
This means that growth stops beyond this point and the observed displacements are due to growth 
in the part of the meristem closer to the root cap junction. Where and how growth stops is of 
great scientific interest. The scatter plot also indicates excess variability towards the end which is 
probably caused by plant-specific scaling effects. 

Some of the features described above motivate us to consider the following class of autonomous 
dynamical systems: 

Xliit) = g,{Xa{t)), l = lr-- ,Nf,i = l,...,n, (1) 
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where {Xii{t) : t G [0, 1], / = 1, • • • , Ni; i = 1, . . . , n} is a collection of smooth curves corresponding 
to n subjects, and there are Ni curves associated with the i-th subject. For example, in the plant 
study, each plant is a subject and each marker corresponds to one growth curve. We assume that, 
all the curves associated with the same subject follow the same dynamics, and these are described 
by the functions {fl'i(-)}r=i- We also assume that only a snapshot of each curve Xii{-) is observed. 
That is, the observations are given by 

Yiij = Xii{tiij) + Eiij, j = l,...,mii, (2) 

where < tm < • • • < tumn < 1 are the observation times for the curve of the i*'* subject, and 
{euj} are independently and identically distributed noise with mean zero and variance cr^ > 0. In 
this paper, we model {^^(OliLi 

gi{-) = e''g{-), i = l,...,n, (3) 

where 

(1) the function g{-) reflects the common underlying mechanism regulating all these dynamical 
systems. It is assumed to be a smooth function and is referred as the gradient function. For 
the plant study, it represents the baseline growth displacement rate for all plants within a 
given group (i.e., control vs. water-stress). 

(2) 6'-s reflect subject-specific effects in these systems. The mean of Oi's is assumed to be zero 
to impose identifiability. In the plant study, O'^s represent plant-specific scaling effects in the 
growth displacement rates for individual plants. 

The simplicity and generality of this model make it appealing for modeling a wide class of 
dynamical systems. First, the gradient function g{-) can be an arbitrary smooth function. If g is 
nonnegative, and the initial conditions Xj;(0)'s are also nonnegative, then the sample trajectories 
are increasing functions, which encompasses growth models that are autonomous. Secondly, the 
scale parameter e^' provides a subject-specific tuning of the dynamics, which is flexible in capturing 
variations of the dynamics in a population. In this paper, our primary goal is to estimate the 
gradient function g nonparametrically. For the plant data, the form of g is not known to the 
biologists, only its behavior at root cap junction and at some later stage of growth are known 
(Silk, 1994). The fact that the growth displacement rate increases from zero at root cap junction 
before becoming a constant at a certain (unknown) distance away from the root tip implies that 
a linear ODE model is apparently not appropriate. Moreover, popular parametric models such as 
the Michaelis-Menten type either do not satisfy the boundary constraints, and/or have parameters 
without clear interpretations in the current context. On the other hand, nonparametric modeling 
provides flexibility and is able to capture features of the dynamical system which are not known to 
us a priori (Section [7|). In addition, the nonparametric fit can be used for diagnostics for lack of 
fit, if realistic parametric models can be proposed. 

The gradient function g being smooth means that it can be well approximated by a basis 
representation approach: 

M 

9{x) ='^f3k4'k,M{x) (4) 
fe=i 

where (pi^ui'), ■ ■ ■ , 4'm,m{-) are linearly independent basis functions, chosen so that their combined 
support covers the range of the observed trajectories. For example, we can use cubic splines with a 
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suitable set of knots. Thus, for a given choice of the basis functions, the unknown parameters in the 
model are the basis coefficients /3 := . . . , Pm)'^ , the scale parameters 6 := {6i}^^^, and possibly 
the initial conditions a := {an := Xii{0) : / = 1, • • • , Ni}^^^. Also, various model parameters, such 
as the number of basis functions M and the knot sequence, need to be selected based on the data. 
Therefore, in essence, this is a nonlinear, semi-parametric, mixed effects model. 

In the plant data, g is nonnegative and thus a modeling scheme imposing this constraint may 
be more advantageous. However, the markers are all placed at a certain distance from the root 
cap junction, where the growth displacement rate is already positive, and the total number of 
measurements per plant is moderately large. These mean that explicitly imposing nonnegativity 
is not crucial for the plant data. Indeed, with the imposition of the boundary constraints, the 
estimate of g turns out to be nonnegative over the entire domain of the measurements (Section [7]). 
In general, if g is strictly positive over the domain of interest, then we can model the logarithm of 
g by basis representation. Also, in this case, the dynamical system is stable in the sense that there 
is no bifurcation phenomenon (Strogatz, 2001). 

3 Model Fitting 

In this section, we propose an iterative estimation procedure that imposes regularization on the 
estimate of 6 and possibly a. One way to achieve this is to treat them as unknown random pa- 
rameters from some parametric distributions. Specifically, we use the following set of working 
assumptions: (i) a^s are independent and identically distributed as N{a,aa) and 6'j's are indepen- 
dent and identically distributed as A^(0, ag), for some a S M and a"^ > 0, cj| > 0; (ii) the noise euj^s 
are independent and identically distributed as N^Oja"^) for o"^ > 0; (iii) the three random vectors 
a, 6, e := {euj} are independent. Under these assumptions, the negative joint log-likelihood of the 
observed data Y := {Ynj}, the scale parameters 6 and the initial conditions a is, up to an additive 
constant and a positive scale constant, 

n Ni rriii n Ni n 

i=i 1=1 j=i 1=1 1=1 i=i 

where Ai = a'^/a'^, A2 = cr'^/ag, and Xii{-) is the trajectory determined by an, 6i, and f3. This can 
be viewed as a hierarchical maximum likelihood approach (Lee, Nelder and Pawitan, 2006), which 
is considered to be a convenient alternative to the full (restricted) maximum likelihood approach. 
Define 

N, 

liij{aii,9i,P) := [Yiij - Xii{tiij;au,ei, (3)f + Xi{au - af/mu + \2Qll ^^mn . 

1=1 

Then the loss function in ([5]) equals Yl^=i J2iti ^^=i^iij{(^iu(^ij P)- Note that the above distribu- 
tional assumptions are simply working assumptions. The expression in ([5]) can also be viewed as 
a regularized £2 loss with penalties on the variability of 9 and a. For the plant data, the initial 
conditions (markers) are chosen according to some fixed experimental design, thus it is natural to 
treat them as fixed effects. Moreover, it does not seem appropriate to shrink the estimates toward 
some common value in this case. Thus in Section [71 we set Ai = when estimating a. For certain 
other problems, treating the initial conditions as random effects may be more suitable. For exam- 
ple, Huang, Liu and Wu (2006) study a problem of HIV dynamics where the initial conditions are 
subject-specific and unobserved. 
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In many situations, there are boundary constraints on the gradient function g. For example, 
according to plant science, both the growth displacement rate and its derivative at the root cap 
junction should be zero. Moreover, it should become a constant at a certain (unknown) distance 
from the root cap junction. Thus for the plant data, it is reasonable to assume that, ^(0) = = ^'(0) 
and g'{x) = for x > A for a given ^ > 0. The former can be implemented by an appropriate 
choice of the basis functions. For the latter, we consider constraints of the form: 0^'Qf3 for an 
M X M positive semi-definite matrix B, which can be thought of as an ^2-type constraint on some 
derivative of g. (See Section [7] for the specification of B). Consequently, the modified objective 
function becomes 

n Ni mu 

L{a,e,(3) :=^^^£,;,(a,z,0i,/3)+/3^B/3. (6) 

i=i 1=1 j=i 

The proposed estimator is then the minimizer of the objective function: 

(a,0,3) := arg min L(a,0,/3). (7) 

Note that, here our main interest is the gradient function g. Thus estimating the parameters of 
the dynamical system together with the sample trajectories and their derivatives simultaneously 
is most efficient. In contrast, if the trajectories and their derivatives are first obtained via pre- 
smoothing (as is done for example in Chen and Wu (2008a, 2008b), Varah (1982)), and then used 
in a nonparametric regression framework to obtain g, it will be inefficient in estimating g. This 
is because, errors introduced in the pre-smoothing step cause loss of information which is not 
retrievable later on, and also information regarding g is not efficiently combined across curves. 
In the following, we propose a numerical procedure for solving ([7]) that has two main ingredients: 

• Given (a, 0,/3), reconstruct the trajectories {Xii{-) ■.l = l,--- ,-/Vj}"^-^ and their derivatives. 
This step can be carried out using a numerical ODE solver, such as the 4*^* order Runge-Kutta 
method (cf. Tenenbaum and Pollard, 1985). 

• Minimize ([6|) with respect to {a, 0,(3). This amounts to a nonlinear least squares problem 
(Bates and Watts, 1988). It can be carried out using either a nonlinear least squares solver, 
like the Levenberg-Marquardt method; or a general optimization procedure, such as the 
Newton-Raphson algorithm. 

The above procedure bears some similarity to the local, or gradient-based, methods discussed in 
Miao et al. (2008). 

We now briefly describe an optimization procedure based on the Levenberg-Marquardt method 
(cf. Nocedal and Wright, 2006). For notational convenience, denote the current estimates by 
a* := {a*J, 0* := {9*} and and define the current residuals as: euj = Ynj — Xii{tiij; a*i, 61,13*). 
For each i = 1 , • • • , n and / = 1 , • • • ,Ni, define the mu x 1 column vectors 

Ji/,a* := y-Q^^^ii{Uij]a*i,9*,f3*)j , en = {eiij)'^^\. 

For each i = 1, • • • , n, define the mj. x 1 column vectors 

/ Q ^ \ m« ,Ni 
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where rrii. := 'Yld=i''^ii is the total number of measurements of the i*^ cluster. Finally, for each 
= 1, • • • , M, define the m.. x 1 column vectors: 



/ Q _ ■\mii,Ni,n 



mii,Ni,n 

rrHi,Ni,n 

J=l.i = l,i=l 5 

j = l,l = l,i = l 



where m.. := XlILi Sz^i total number of measurements. Note that, given a*,6* and 

P*, the trajectories {Xii}'s and their gradients (as well as Hessians) can be easily evaluated on a 
fine grid by using numerical ODE solvers such as the 4*'' order Runge-Kutta method as mentioned 
above (see Appendix A). 

We break the updating step into three parts corresponding to the three different sets of pa- 
rameters. For each set of parameters, we first derive a first order Taylor expansion of the curves 
{Xii} around the current values of these parameters and then update them by a least squares 
fitting, while keeping the other two sets of parameters fixed at the current values. The equation 
for updating /3, while keeping a* and 6* fixed, is 

V + ^3 diag(jj* J^*) + b] (/3 - /3*) = Jj.i - B/3*, 

where J^* := ( Jg* : • • • : J^* ) is an m.. x M matrix. Here A3 is a sequence of positive constants 
converging to zero as the number of iterations increases. They are used to avoid possible singularities 
in the system of equations. The normal equation for updating 6i is 

(J^g. J,,,. + - 0*) = ile*e, - X29;, i = 1, . . . , n. (8) 

The equation for updating au is derived similarly, while keeping 6i and P fixed at 6^, (3*: 

iJa,a*Jii,a*+>'i){aii-a*i)=3ji*eii + Xia*i, l = l,---,Ni, i = l,---,n, (9) 

where a* = Z^/l'i ^u/^-i '^u = o.* — a*i with N. := Yll^=i being the total number of sample 
curves. 

In summary, this procedure begins by taking initial estimates and then iterates by cycling 
through the updating steps for /3, and a until convergence. The initial estimates can be conve- 
niently chosen. For example, a^™ = Ym, 6f^^ = 0; or a*™ = Z^^Li S/l'i ^iii- Even though the 
model is identifiable, in practice, for small n, there can be drift in the estimates of 9i and g due 
to flatness of the objective function in some regions. To avoid this and increase stability, we also 
impose the condition that Yl^=i — 0- This can be easily achieved by subtracting 9* := ^ Yl7=i 
from 0* at each iteration after updating {9i}. 

All three updating steps described above are based on the general principle of Levenberg- 
Marquardt algorithm by the linearization of the curves {Xn} (see Appendix B). However, the 
tuning parameter A3 plays a different role than the penalty parameters Ai and A2. The parameter 
A3 is used to stabilize the updates of /3 and thereby facilitate convergence. Thus it needs to decrease 
to zero with increasing iterations in order to avoid introducing bias in the estimate. There are ways 
of implementing this adaptively (see e.g. Nocedal and Wright, 2006, Ch. 10). In this paper, we 
use a simple non-adaptive method: A3j = X^/j for the j-th iteration, for some pre-specified A3 > 0. 
On the other hand, Ai and A2 are parts of the penalized loss function ([6]). Their main role is to 
control the bias-variance trade-off of the estimators, even though they also help in regularizing the 
optimization procedure. From the likelihood view point, Ai, A2 are determined by the variances o"^. 



9 



cr^ and a^. After each loop over all the parameter updates, we can estimate these variances from 
the current residuals and current values of a and 9. By assuming that mu > 2 for each pair (i, I), 



n Ni mu 



m.. — N. - n — M 

i=i 1=1 j=i 



cr. 



n Ni 

1=1 1=1 i=l 

We can then plug in the estimates a^, ct^ and to get new values of Ai and A2 for the next 
iteration. On the other hand, if we take the penalized loss function view point, we can simply 
treat Ai, A2 as fixed regularization parameters, and then use a model selection approach to select 
their values based on data. In the following sections, we refer the method as adaptive if Ai, A2 
are updated after each iteration; and refer the method as non- adaptive if they are kept fixed 
throughout the optimization. 

The Levenberg-Marquardt method is quite stable and robust to the initial estimates. However, 
it converges slowly in the neighborhood of the minima of the objective function. On the other hand, 
the Newton-Raphson algorithm has a very fast convergence rate when starting from estimates that 
are already near the minima. Thus, in practice the we first use the Levenberg-Marquardt approach 
to obtain a reasonable estimate, and then use the Newton-Raphson algorithm to expedite the search 
of the minima. The implementation of the Newton-Raphson algorithm of the current problem is 
standard and is outlined in Appendix C. 



4 Model Selection 

After specifying a scheme for the basis functions {<pk,Mi')}j we still need to determine various 
model parameters such as the number of basis functions M, the knot sequence, etc. In the liter- 
ature AIC/BIC/AICc criteria have been proposed for model selection while estimating dynamical 
systems with nonparametric time-dependent components (e.g. Miao et al., 2008). Here, we propose 
an approximate leave-one-curve-out cross-validation score for model selection. Under the current 
context, the leave-one-curve-out CV score is defined as 

n Ni mu 

cy - EEE^^S(ar\r^3^"'^) (10) 

i=l 1=1 j=l 

where 9^ and /3 are estimates of 9i and /3, respectively, based on the data after dropping the 
j^th Q^j-yg ii^Q jth cluster; and a"-^ is the minimizer of ^1^=1 ^Uji^ih'^ ^^\(3^ ^) with respect to 
an. The function (.^i- is a suitable criterion function for cross validation. Here, we use the prediction 
error loss: 

^iijiO'ih^i^ ■= (Yiij — Xii{tiij;aii,6i, P)j . 

Calculating CV score (jlOp is computationally very demanding. Therefore, we propose to approx- 
imate 9^ and {3 ^ by a first order Taylor expansion around the estimates 9i , /3 based on the 
full data. Consequently we derive an approximate CV score which is computationally inexpensive. 
A similar approach is taken in Peng and Paul (2009) under the context of functional principal 
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component analysis. Observe that, when evaluated at the estimate a, 6 and /3 based on the full 
data, 



_d_ 



5^d+2A2^* = 0, i = l,---,n; 



d_ 
d(3 



J^C, +2B/3 = 0. (11) 



Whereas, when evaluated at the drop (i, /)-estimates: ^ *'\/3^ \ 



Expanding the left hand side of (112|) around /3, we obtain 



+2B/3 = 0. 



(12) 



. 5: 

i*,«*j:(i*,/*)7^(i,Z) 



9^3 



+ 2B/3 + 



E 



92 



dp 



+ 



E 



i*,«*,i:(i*,/*)^(i,0 



9/39/3 



9/39/3 



/3 



+ 2B 



+ 2B 



where in the second step we invoked pT]) and approximated {a^^^ *'^}){^ *'^} by {an}, {^j}, re- 
spectively. Similar calculations are carried out for 6^ . Thus we obtain the following first order 
approximations : 



+ 



f)Zfcv 



+ 2A2 



l'=lj'=l 



E 



90, 



(13) 



These gradients and Hessians are all evaluated at (a,6,P), and thus they have already been com- 
puted (on a fine grid) in the course of obtaining these estimates. Thus, there is almost no additional 
computational cost to obtain these approximations. Now for i = 1, . . . ,n;l = 1, ■ ■ ■ ,Ni, define 



a-i = argmm 



imY,iYuj-Xuituj;a,9i^ "'\p^ '^)' + AiC 



a — a) 



(14) 



where a is the estimator of a obtained from the full data. Finally, the approximate leave-one- 
curve-out cross-validation score is 



n Ni ran 

^-EEE^^(«ir^r'\3^"'^^ 
,=1 1=1 j=i 



(15) 
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5 Asymptotic Theory 



In this section, we present a result on the consistency of the proposed estimator of g under suitable 
technical conditions. We assume that the number of subjects n is fixed; and the number of mea- 
surements per curve mn, and number of curves Ni per subject, increase to infinity together. When 
n is fixed, the asymptotic analysis is similar irrespective of whether 0j's are viewed as fixed effects 
or random effects. Hence, for simplicity, we treat ^j's as fixed effects and impose the identifiability 
constraint 6i = 0. Due to this restriction, we modify the loss function ([SD slightly by replacing the 
penalty A2 Y17=i '^ith -^2 Z^"=2(^« ~ where 9 = ^^"=2 ~ !)• Moreover, since n is finite, in 
practice we can relabel the subjects so that the curves corresponding to subject 1 has the highest 
rate of growth, and hence 9i < for all i > 1. This relabeling is not necessary but simplifies the 
arguments considerably. 

Moreover, to be consistent with the setting of the plant data, we focus on the case where the 
time points for the different curves corresponding to the same subject are the same, so that, in 
particular, mu = mj. We assume that the time points come from a common continuous distribution 
Ft- We also assume that the gradient function g{x) is positive for x > and is defined on a domain 
D = [xo,xi] C M^; and the initial conditions {au := Xii{0)}'s are observed (and hence Ai = 0) 
and are randomly chosen from a common continuous distribution Fa with support [xo,a;2] where 
X2 < Xi. 

Before we state the regularity conditions required for proving the consistency result, we highlight 
two aspects of the asymptotic analysis. Note that, the current problem differs from standard 
semiparametric nonlinear mixed effects models. First, the estimation of g is an inverse problem, 
since it implicitly requires knowledge of the derivatives of the trajectories of the ODE which are not 
directly observed. The degree of ill-posedness is quantified by studying the behavior of the expected 
Jacobian matrix of the sample trajectory with respect to f3. This matrix would be well-conditioned 
under a standard nonparametric function estimation context. However, in the current case, its 
condition number goes to infinity with the dimension of the model space M. Secondly, unlike in 
standard nonparametric function estimation problems where the effect of the estimation error is 
localized, the estimation error propagates throughout the entire domain of g through the dynamical 
system. Therefore, sufficient knowledge of the behavior of g at the boundaries is imperative. 

We assume the following: 

Al g £ C^[D) for some integer p > 4, where D = [xq,xi\ C M^. 
A2 6iS are fixed parameters with Oi = 0. 

A3 The collection of basis functions <I>a/ := {</>!, Af '/*m,a/} satisfies: (i) 4>k,M £ C^{D) for 
aU k; (fi) sup.g^Efcli l'^M/(^)l' = ©(M^+^i), for j = 0,1,2; (ifi) for every k, the length 
of the support of 0fc,A/ is 0{M~^); (iv) for every M, there is a /3* G such that \\ g — 
Ek=iP*k<t>k,M \\loo^d)= 0{M-^Py, II 9' - EkLiPl^PlM IU-(D)= 0{M~'), for some c > 0; 
'^^=iPk^kM i^ Lipschitz with Lipschitz constant 0(M); and || '^k=if^k^k M \\l^{d)= 0{1). 

A4 Xj;(0)'s are i.i.d. from a continuous distribution Fa- Denote supp(-Fa) = [xo,3;2] and let be 
a fixed, open interval containing the true Oi's, denoted by 6^. Then there exists a r > such 
that for all a £ Fa and for all 9 £ @, the initial value problem 

x'{t) = e^f{x{t)), x(0) = a (16) 
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has a solution x{t) := x(t; a, 6, f) on [0, 1] for all / G M.{g,T), where 

M{g,T) :={fGCHD) :|| / - 5 l|i,D< r}. 

Moreover, the range of •, •, /) (as a mapping from [0, 1] x supp(Fa) x 0) is contained in 
D ± €(r) for some e(r) > (with limT_>oe(r) = 0) for all / G M.{g,T). Here, || • \\i^d is 
the seminorm defined by || / ||i,d=|| / + II /' IIl°°(D)- Furthermore, the range of 

x{-;-,0,g) contains D. 

A5 For each i = 1, . . . , n, for all / = 1, . . . , Ni, the time points tnj (j = 1, . . . , m^) belong the set 
{Tjjv : 1 < j' < rui}. And {Tjj/} are i.i.d. from the continuous distribution Ft supported 
on [0, 1] with a density fx satisfying ci < /t < C2 for some < ci < C2 < oo. Moreover, 
m := Y17=i ''^i / ''^ oo as N := XlILi-^*/^ Also, both Ni^s and jn^.'s increase to 

infinity uniformly meaning that maxj Ni/ min^ iVj and maxj mj/miuj remain bounded. 

A6 Define Xii{-; Xii{0),9i, P) to be the solution of the initial value problem 

M 

x'it) = e'^' ^/3fc0fc,M(x(t)), t £ [0, 1], x(0) = XaiO). (17) 

k=l 

Let X^/{-;0i,f3) and X^{-;9i,P) be its partial derivatives with respect to parameters 9i and 
13. And let /3* G be as in A3. Define Gl gg := Kg* ^f3*{X.l{Tir,e* , (3*))^ , Gi ^g := 

Efl.,^. (xf,' (r,,i;e*,/3*)X,^(ri,i;0*,/3*)) andC:^^^ := E^.^^g* (x,^(r,,i; 0*, ^ )(X,'^(T,,i; 0*, /3*))^ 
where Eg*^^* denotes the expectation over the joint distribution of (Xji(O), Tj^i) evaluated 



at Oi = 6* and (3 = 13*. Define G^fio = diag{Gi gg)2^2^ 



G2 . . /~ir/ 

^ flfl ■ • • • • 



, and 



G^^/j/j = J2i=i pfs- Then, there exists a function and a constant C3 G (0, 00), such that, 

II {G,,i3f3)-^ \\< KM and II {G,,eey^ \\< cg. (18) 

A7 The noise ej/j's are i.i.d. A^(0,cr^) with fj^ bounded above. 

Before stating the main result, we give a brief explanation of these assumptions. Al ensures 
enough smoothness of the solution paths of the differential equation p6p . It also ensures that the 
approximation error, when g is approximated in the basis $M) is of an appropriate order. Condition 
A3 is satisfied when we approximate g using the {p — l)-th order B-splines with equally spaced 
knots on the interval D which are normalized so that (j)k^M{x)'^dx = 1 for all k. Note that (7^* can 
be viewed as an optimal approximation of g in the space generated by $jv/- Condition A4 ensures 
that a solution of (|16p exist for all / of the form g^ with (3 sufficiently close to fi* . This implies 
that we can apply the perturbation theory of differential equations to bound the fluctuations of 
the sample paths due to a perturbation of the parameters. Condition A5 ensures that the time- 
points {Tij} cover the domain D randomly and densely, and that there is a minimum amount of 
information per sample curve in the data. Condition A6 is about the estimability of a parameter 
(in this case g) in a semiparametric problem in the presence of nuisance parameters (in this case 
{Oi}). Indeed, the matrix G*,/?/? — G*^/36»(G^=^ee))~-^G*^e/3 plays the role of the information matrix for f3 
at (9*, (3*). Equation (llSp essentially quantifies the degree of ill-conditionedness of the information 
matrix for (3. Note that A4 together with A6 implicitly imposes a restriction on the magnitude 
of II 9' \\Loa{D)- Condition A6 has further implications. Unlike in parametric problems, where the 
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information matrix is typically well-conditioned, we have hm ^ oo in our setting (see Theorem 
2 and Proposition 1 below). Note that in situations when 5 > and the initial conditions are 
nonnegative, one can simplify A6 considerably, since then we can obtain explicit formulas for the 
derivatives of the sample paths (see Appendix A). And then one can easily verify the second part 
of equation (fTBj) . 

Theorem 1: Assume that the data follow the model described by equations ([2]j, (0j and ^ with 
9i = 0. Suppose that the true gradient function g, the distributions Fa and Ft, and the collection of 
basis functions satisfy A1-A7 . Suppose further that g is strictly positive over D = [xq,xi]. Sup- 
pose that {Xii{0)} are known (so that Xi = 0), X2 = o{aj\;Nmt<^j) and the sequence M = M{N,rn) 

is such that min{iV,m} > ka/M log(iVm), ka/M-(p-i) 0, and iaax{KMM^/'^ , k\{^ M^^^} 
as N,rn 00, where > C iiaax{asK^/.j^ M^^'^ {Nm)~^^'^ , k^j^^^ M~p} for some sufficiently large 
constant C > 0. Then there exists a minimizer (0,(3) of the objective function such that if 
g := 'Y^^=i(3k(t>k,M J then the following holds with probability tending to 1: 

~ n 

/ \g{x)-g{x)\^dx<a% + 0{M-^P), y2\0^ - 6*1^ < a%. (19) 

As explained earlier, km is related to the inverse of the smallest eigenvalue of the matrix 

G*,i3i3 G*,i3e 

In order to show that our method leads to a consistent estimator of g, we need to know the 
behavior of k,m as M — > 00. The following result quantifies the behavior when we choose a B-spline 
basis with equally spaced knots inside the domain D. 

Theorem 2: Suppose that supp{Fa) = [3:0, 3^2] C M"^ and g is strictly positive over the domain 
D = [xo,xi]. Suppose also that the (normalized) B-splines of order > 2 are used as basis functions 
{4'k,M} where the knots are equally spaced on the interval [xq + 5,xi — S], for some small constant 
6 >'o. Then km = 0{M'^). 

The condition that the knots are in the interior of the domain D is justified if the function 
g is completely known on the set [xo,xo + (5] U [xi — 5,xi\. Then this information can be used 
to modulate the B-splines near the boundaries so that all the properties listed in A3 still hold 
and we have the appropriate order of the approximations. We conjecture that the same result 
{km = 0{M'^)) still holds even \i g is known only up to a parametric form near the boundaries, and 
a combination of the parametric form and B-splines with equally spaced knots is used to represent 
it. If instead the distribution Fa is such that near the end points (xq and X2) of the support of Fa, 
the density behaves like (x — xo)~^'^'^ and (x2 — x)"^"*"^, for some 7 S (0, 1], then it can be shown 
that (Proposition 1) km = 0(M^"*"^'^). Thus, in the worst case scenario, we can only guarantee 
that Km = O(M^). In that case g needs to have a higher order of smoothness {g € C^^''{D), for 
some e > 1/2), and higher-order (at least seventh order) B-splines are needed to ensure consistency. 

It can be shown that under mild conditions km should be at least O(M^). Thus, the condition 
aAT max{KA/M^/^, K^y^^^M^/^} = o(l) can be simplified to kmoinM^/"^ = o(l). When km x M^, 
Theorem 1 holds with p = 4, so that g G and cubic B-splines can be used. Moreover, under 
that setting as long as m/N is bounded both above and below and cJe is bounded below, then 
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min{A^,m} ^ k.]\jM log{Nm). The following proposition states the dependence of km on the 
behavior of the density of the distribution Fa- 

Proposition 1: Assume that the density of Fa behaves like {x — xq)"^'^'^ and {x2 — x)~^~^'^ , near 
the endpoints xq and X2, for some 7 G (0, 1], and is bounded away from zero in the interior. Then 

KM = 0{M^+^^). 

The proof of Theorem 1 involves a second order Taylor expansion of loss function around the 
optimal parameter {6*,P*). We apply results on perturbation of differential equations (cf. Deufl- 
hard and Bornemann, 2002, Ch. 3) to bound the bias terms \Xii{tiij]aii,9i,f) — Xii{tiij; aii,6i, g)\ 
for arbitrary 9i and functions /, g. The same approach also allows us to provide bounds for various 
terms involving partial derivatives of the sample paths with respect to the parameters in the afore- 
mentioned Taylor expansion. Proof of Theorem 2 involves an inequality (Halerpin-Pitt inequality) 
on bounding the square integral of a function by the square integrals of its derivatives (Mitrinovic, 
Pecaric and Fink, 1991, p. 8). The detailed proofs are given in Appendix E. 

6 Simulation 

In this section, we conduct a simulation study to demonstrate the effectiveness of the proposed 
estimation and model selection procedures. In the simulation, the true gradient function g is rep- 
resented by A'/* = 4 cubic B-spline basis functions with knots at (0.35,0.6,0.85,1.1) and basis 
coefficients /3 = (0.1,1.2,1.6,0.4)-^. It is depicted by the solid curve in Figured! We consider 
two different settings for the number of measurements per curve: moderate case ~ rn^s are inde- 
pendently and identically distributed as Uniform[5, 20]; sparse case - mn^s are independently and 
identically distributed as Uniform[3, 8]. Measurement times {tuj} are independently and identically 
distributed as Uniform[0, 1]. The scale parameters 0j's are randomly sampled from A^(0, (t|) with 
ag = 0.1; and the initial conditions a^s are randomly sampled from a Caxi distribution, with 
Ca, ka > chosen such that a = 0.25, 0"^ = 0.05. Finally, the residuals euj^s are randomly sampled 
from N{0,a'^) with Ue = 0.01. Throughout the simulation, we set the number of subjects n = 10 
and the number of curves per subject Ni = N = 20. Observations {Ynj} are generated using the 
model specified by equations ([I]) - (jl]) in Section [2l For all the settings, 50 independent data sets 
are used to evaluate the performance of the proposed procedure. 

In the estimation procedure, we consider cubic B-spline basis functions with knots at points 
0.1 + (1 : M)/M to model g, where M varies from 2 to 6. The Levenb erg-Mar qardt step is chosen 
to be non-adaptive, and the Newton-Raphson step is chosen to be adaptive (see Section [3] for 
the definition of adaptive and non-adaptive). We examine three different sets of initial values 
for Ai and A2: (i) Ai = a'^/a^ = 0.04, A2 = cr^/o"^ = 0.01 ("true" values); (ii) Ai = 0.01, A2 = 
0.0025 ("deflated" values); (iii) Ai = 0.16, A2 = 0.04 ("inflated" values). It turns out that the 
estimation and model selection procedures are quite robust to the initial choice of (Ai, A2), thereby 
demonstrating the effectiveness of the adaptive method used in the Newton-Raphson step. Thus 
in the following, we only report the results when the "true" values are used. 

We also compare results when (i) the initial conditions a are known, and hence not estimated; 
and (ii) when a are estimated. As can be seen from Table (H the estimation procedure converges 
well and the true model (M* = 4) is selected most of the times for all the cases. Mean integrated 
squared error (MISE) and Mean squared prediction error (MSPE) and the corresponding standard 
deviations, SD(ISE) and SD(SPE), based on 50 independent data sets, are used for measuring the 
estimation accuracy of 'g and 6, respectively. Since the true model is selected most of the times, we 
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Table 1: Convergence and model selection based on 50 independent replicates. 









a 


known 






a 


estimated 






Model 


2 


3 


4 


5 


6 


2 


3 


4 


5 


6 


moderate 


Number converged 


50 


50 


50 


50 


50 


50 


7 


50 


50 


46 




Number selected 








46 


1 


3 








49 


1 





sparse 


Number converged 


50 


50 


50 


50 


50 


50 


5 


49 


44 


38 




Number selected 








45 





5 


1 





47 


1 


1 



Table 2: Estimation accuracy under the true model* 







MISE(5) 


SD(ISE) 


MSPE(0) 


SD(SPE) 


a known 


moderate 


0.069 


0.072 


0.085 


0.095 




sparse 


0.072 


0.073 


0.085 


0.095 


a estimated 


moderate 


0.088 


0.079 


0.086 


0.095 




sparse 


0.146 


0.129 


0.087 


0.094 



* All numbers are multiplied by 100 



only report results under the true model in Tabled As can be seen from this table, when the initial 
conditions a are known, there is not much difference of the performance between the moderate 
case and the sparse case. On the other hand, when a are not known, the advantages of having 
more measurements become much more prominent. In Figure [H we have a visual comparison of 
the fits when the initial conditions a are known versus when they are estimated in the sparse 
case. In the moderate case, there is very little visual difference under these two settings. We plot 
the true g (solid green curve), the pointwise mean of ^ (broken red curve), and 2.5% and 97.5% 
pointwise quantiles (dotted blue curves) under the true model. These plots show that both fits 
are almost unbiased. Also, when a are estimated, there is greater variability in the estimated 
g at smaller values of x, partly due to scarcity of data in that region. Overall, as can be seen 
from these tables and figures, the proposed estimation and model selection procedures perform 
effectively. Moreover, with sufficient information, explicitly imposing nonnegativity in the model 
does not seem to be crucial: for the moderate and/or "a known" cases the resulting estimators of 
g are always nonnegative. 

7 Application: Plant Growth Data 

In this section, we apply the proposed method to the plant growth data from Sacks et al. (1997) 
described in the earlier Sections. The data consist of measurements on ten plants from a control 
group and nine plants from a treatment group where the plants are under water stress. The 
primary roots had grown for approximately 18 hours in the normal and stressed conditions before 
the measurements were taken. The roots were marked at different places using a water-soluble 
marker and high-resolution photographs were used to measure the displacements of the marked 
places. The measurements were in terms of distances from the root cap junction (in millimeters) 
and were taken for each of these marked places, hereafter markers, over an approximate 12-hour 
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Figure 4: True and fitted gradient functions for the sparse case. Left panel: a known; Right paneh 
a estimated. 



x(0) known x(0) estimated 




period while the plants were growing. Note that, measurements were only taken in the meristem. 
Thus whenever a marker moved outside of the meristem, its displacement would not be recorded at 
later times anymore. This, together with possible technical failures (in taking measurements), is the 
reason why in Figure [2] some growth trajectories were cut short. A similar, but more sophisticated, 
data acquisition technique is described in Walter et al. (2002), who study the diurnal pattern of 
root growth in maize. Van der Weele et al. (2003) describe a more advanced data acquisition 
technique for measuring the expansion profile of a growing root at a high spatial and temporal 
resolution. They also propose computational methods for estimating the growth velocity from 
this dense image data. Basu et al. (2007) develop a a new image-analysis technique to study 
spatio-temporal patterns of growth and curvature of roots that tracks the displacement of particles 
on the root over space and time. These methods, while providing plant scientists with valuable 
information, are limited in that, they do not provide an inferential framework and they require 
very dense measurements. Our method, even though designed to handle sparse data, is potentially 
applicable to these data as well. 

Consider the model described in Section [2j For the control group, we have the number of curves 
per subject Ni varying in between 10 and 29; and for the water stress group, we have 12 < A'^j < 
31. The observed growth displacement measurements {Ynj : j = 1, . . . , mn,! = 1, . . . , -/VijJLi 
assumed to follow model ([2]), where mu is the number of measurements taken for the i^^ plant 
at its marker, which varies between 2 and 17; and {tuj : j = I,-- - ,mii} are the times of 
measurements, which are in between [0, 12] hours. Altogether, for the control group there are 228 
curves with a total of 1486 measurements and for the treatment group there are 217 curves with 
1712 measurements in total. We are interested in comparing the baseline growth displacement rate 
between the treatment and control groups. 

As discussed earlier, there are natural constraints for the plant growth dynamics. Theoretically, 
g{0) = = ^'(0) and g'{x) = for x > A for some constant A > 0. For the former constraint, we 
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can simply omit the constant and linear terms in the spline basis. And for the latter constraint, in 
the objective function ([6]) we use 

I-2A I-2A 

(3^Bf3:=XR {g'{x)fdx = XR0^[ </.'(x)(,^'(x)f 
J A J A 

where 4> = {4>i^m, ■ ■ ■ , 4>m,mY' and \r is a large positive number quantifying the severity of this 
constraint; and A> Q determines where the growth displacement rate becomes a constant. A and 
\p> are both adaptively determined by the model selection scheme discussed in Section HI Moreover, 
as discussed earlier, since it is not appropriate to shrink the initial conditions {an} towards a fixed 
number, we set Ai = in the loss function 

We first describe a simple regression-based method for getting a crude initial estimate of the 
function g{-)^ as well as selecting a candidate set of knots. This involves (i) computing the re- 

scaled empirical derivatives e~^i X[^ - of the sample curves from the data, where the empirical 

derivatives are defined by taking divided differences: X'-^ - := (yj/^+i) — — tnj), and 

91 is a preliminary estimate of 6i] and (ii) regressing the re-scaled empirical derivatives onto a 
set of basis functions evaluated at the corresponding sample averages: Xuj := (yj/Q+i) + Yiij)/2. 
In this paper, we use the basis {x'^,x^,{x — Xk)%}^^i with a pre-specified, dense set of knots 
{xk}^^i- Then, a model selection procedure, like the stepwise regression, with either AIC or BIC 
criterion, can be used to select a set of candidate knots. In the following, we shall refer this method 
as stepwise-regression. The resulting estimate of g and the selected knots can then act as a 
starting point for the proposed procedure. We expect this simple method to work reasonably well 
only when the number of measurements per curve is at least moderately large. Comparisons given 
later (Figure [7|) demonstrate a clear superiority of the proposed method over this simple approach. 

Next, we fit the model to the control group and the treatment group separately. For the 
control group, we first fit models with g represented in cubic B-splines with equally spaced knot 
sequence 1 + 11.5(1 : M)/M for M = 2,3,4, •• • , 12. At this stage, we set /3™ = 1m, = 0^, 
a*™ = {Xii{tiii) : I = I, ■ ■ ■ , Ni)f^-^^. For Levenberg-Marquardt step, we fix Ai = and A2 = 
0.0025; and we update Ai,A2 adaptively in the Newton- Raphson step. The criterion based on 
the approximate CV score (jlSp selects the model with M = 9 basis functions (see Appendix D). 
This is not surprising since when equally spaced knots are used, usually a large number of basis 
functions are needed to fit the data adequately. In order to get a more parsimonious model, we 
consider the stepwise-regression method to obtain an initial estimate of g as well as finding 
a candidate set of knots. We use 28 equally spaced candidate knots on the interval [0.5, 14] and 
use the fitted values {9i}j2^^ from the previous fit. The AIC criterion selects 11 knots. We then 
consider various submodels with knots selected from this set of 11 knots and fit the corresponding 
models again using the procedure described in Section [3l Specifically, we first apply the Levenberg- 
Marquardt procedure with Ai, A2 fixed at Ai = and A2 = (a*"*)^/(ag"*)^ = 0.042, respectively, 
where a*"* and a^"* are obtained from the stepwise-regression fit. Then, after convergence of /3 
up to a desired precision (threshold of 0.005 for || /S"'"^ - /S"-""" ||), we apply the Newton-Raphson 
procedure with Ai fixed at zero, but A2 adaptively updated from the data. The approximate CV 
scores for various submodels are reported in Table [3l The parameters A and Xr are also varied and 
selected by the approximate CV score. Based on the approximate CV score, the model with knot 
sequence (3.0,4.0,6.0,9.0,9.5) and {A, Xr) = (9,10^) is selected. A similar procedure is applied 
to the treatment group. It turns out that the model with knot sequence (3.0, 3.5, 7.5) performs 
considerably better than other candidate models, and hence we only report the approximate CV 
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Table 3: Model selection for real data. Control group: approximate CV scores for four submod- 
els of the model selected by the AlC criterion in the stepwise-regression step. Ml: knots = 
(3.0,4.0,5.0,6.0,9.0,9.5); M2: knots = (3.0,4.0,5.5,6.0,9.0,9.5); M3: knots = (3.0,4.0,6.0,9.0,9.5); 
M4: knots = (3.0, 4.5, 6.0, 9.0, 9.5). Treatment group: approximate CV scores for the model M: knots 
= (3.0,3.5,7.5). 







A/? = 10^ 






Xr = 10^ 




Control 


Model 


A = 8.5 


A = 9 


A = 9.5 


A = 8.5 


A = 9 


A = 9.5 




Ml 


53.0924 


53.0877 


53.1299 


54.6422 


53.0803 


53.1307 




M2 


53.0942 


53.0898 


53.1374 


54.5190 


53.0835 


53.1375 




M3 


53.0300 


53.0355 


53.0729 


53.8769 


53.0063 


53.0729 




M4 


53.0420 


53.0409 


53.0723 


54.0538 


53.0198 


53.0722 


Treatment 


Model 


A = 7 


A = 7.5 


^ = 8 


A = 7 


A = 7.5 


^ = 8 


M 


64.9707 


64.9835 


64.9843 


65.5798* 


64.9817 


64.9817 



* no convergence 



scores under this model in Table [3] with various choices of {A, Xr). It can be seen that, (A, Xr) = 
(7, 10"^) has the smallest approximate CV score. 

Figure [5] shows the estimated gradient functions 'g under the selected models for the control 
and treatment groups, respectively. First of all, there is no growth bump observed for either group. 
This plot also indicates that different dynamics are at play for the two groups. In the part of 
the meristem closer to the root cap junction (distance within ~ 5.5mm), the growth displacement 
rate for the treatment group is higher than that for the control group. This is probably due to 
the greater cell elongation rate under water stress condition in this part of the meristem so that 
the root can reach deeper in the soil to get enough water. This is a known phenomenon in plant 
science. The growth displacement rate for the treatment group flattens out beyond a distance of 
about 6 mm from the root cap junction. The same phenomenon happens for the control group, 
however at a further distance of about 8 mm from the root cap junction. Also, the final constant 
growth displacement rate of the control group is higher than that of the treatment group. This is 
due to the stunting effect of water stress on these plants, which results in an earlier stop of growth 
and a slower cell division rate. Figure [6] shows the estimated relative elemental growth rates (i.e., 
g') for these two groups. Relative elemental growth rate (REGR) relates the magnitude of growth 
directly to the location along the meristem. For both groups, the growth is fastest in the middle 
part of the meristem (~ 3.8 mm for control group and ~ 3.1 for treatment group), and then growth 
dies down pretty sharply and eventually stops. Again, we observe a faster growth in the part of 
the meristem closer to the root cap junction for the water stress group and the growth dies down 
more quickly compared to the control group. The shape of the estimated g may suggest that it 
might be modeled by a logistic function with suitably chosen location and scale parameters, even 
though the scientific meaning of these parameters is unclear and the boundary constraints are not 
satisfied exactly. As discussed earlier, there is insufficient knowledge from plant science to suggest 
a functional form beforehand. This points to one major purpose of nonparametric modeling, which 
is to provide insight and to suggest candidate parametric models for further study. 

Figure [7] shows the residual versus time plot for the treatment group. The plot for the control 
group is similar and thus is omitted. This plot shows that the procedure based on minimizing the 
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Figure 5: Fitted gradient functions under the selected models for control and treatment (water- 
stress) groups, respectively. 



control 
water stress 



distance from root cap junction (in mm) 



objective function ([6]) has much smaller and more evenly spread residuals (SSE = 64.50) than the 
fit by stepwise-regression (SSE = 147.57), indicating a clear benefit of the more sophisticated 
approach. Overall, by considering the residual plots and CV scores, the estimation and model 
selection procedures give reasonable fits under both experimental conditions. Note that, for the 
first six hours, the residuals (right panel of Figure [7|) show some time-dependent pattern, which is 
not present for later times. Since throughout the whole 12 hour period, the residuals remain small 
compared to the scale of the measurements, the autonomous system approximation seems to be 
adequate for practical purposes. Modeling growth dynamics through nonautonomous systems may 
enable scientists to determine the stages of growth that are not steady across a region of the root. 
This is a topic of future research. 
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Figure 6: Fitted relative elemental growth rate (REGR) under the selected models for control and 
treatment groups, respectively. 



Figure 7: Residual versus time plots for the treatment group. Left panel: fit by 
stepwise-regression; Right panel: fit by the proposed method. 
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Appendix A : Reconstruction of Xii{-) and its derivatives 



In this section, we describe how to evaluate the (i,Z)-th sample trajectory Xii{-) and its deriva- 
tives given /3, 9i and an on a fine grid. For notational simplicity, we omit the dependence of the 
trajectories on the parameters {a,6,P), and drop the subscript M from (pk,M- 

Note that, Xii{-) satisfies the first order ODE 

d 

-Xa{t) = e'^Y.(^kMXii{t)), Xii{0) = aii, te[0,l]. (20) 

k=l 

Or equivalently 

Xu{t) = au+ / e^^\2PkMX^l{s))ds, t G [0, 1]. (21) 

We first describe a numerical procedure (4*^ order Runge-Kutta method) for constructing the 
sample trajectories Xii{t) and their derivatives (with respect to the parameters) on a pre-specified 
fine grid. 



Runge-Kutta method: the general procedure 

Suppose that a family of first order ODE is described in terms of the parameters generically denoted 
by ^ = (^yij^)) where r]i denotes the initial condition and r]2 can be vector- valued: 

^/(t) = G(t,/(t),7?2), /(0) = 7?i, tG[0,l]. (22) 

where G{t, x, r]2) is a smooth function. Denote the solution for this family of ODE as f{t, rj). Given 
the function G and the parameter 77, f{t,ri) can be solved numerically by an ODE solver. One of 
the commonly used approaches to solve such an initial value problem is the 4*'* order Runge-Kutta 
method. For a pre-specified small value h > 0, the 4*^ order Runge-Kutta method proceeds as 
follows: 

1. Initial step: define yo = rji and to = 0; 

2. Iterative step: in the m-|-l step (for < m < [1/h]), define ym+i = ym + ^{ki + 2k2 + 2ks + k4), 
and tm+i = tm + h, where 

ki = G{tm,ym,'n2) 
k2 = G {tjn + -,ym + ^ki,r]2 

h = G (^t„i + ^,ym + ^k2,'n2 
h = G{tra + h,ym + hk^,r]2). 

3. Final step: set fitrmV) = ym. for m = 0, • • • , 

Thus, at the end we obtain an evaluation (approximation) of /(•, J?) on the grid points {0, ft,, 2/i, • • • , }. 
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Note that f{t,r]) satisfies, 

f{t,rj) = r]i+ [ G{sJ{s,ri),r]2)ds, t > 0. 
Jo 



(23) 



Partially differentiating f{t,r]) with respect to rj and taking derivatives inside the integral, we 
obtain 



dm 



_d_ 



1 + 

ft 



/■* d 







_d_ 

dm 



f{s, v)Gf{s, f{s, rf),ri2) + /(s, ??), r^a) 



ds, 



(24) 
(25) 



where Gf and Gr^ denote the partial derivatives of G with respect to its second and third arguments, 
respectively. In equations (p^ and ([25]) . if we view the f{-,r]) inside Gf,Gr^ as known , ^^f{t,ri) 
is the solution of the first order ODE 

j^pit)=H{t,p{t),7]2), P(0) = 1, tG[0,l], 

where H{t,x,rj2) = xG f{t, f{t,ri),rj2). Similarly, -^f{t,ri) is the solution of the first order ODE 
with p(0) = and H{t,x,r]2) = xGf{t,f{t,r]),r]2) + Gr^{t, f{t,r]),ri2). Thus, given the function 
G and the parameter rj, a general strategy for numerically computing f{-,rj) and its gradient 
^/(•, 77) on a fine grid is to first use the Runge-Kutta method to approximate the solution to (p3]) . 
and then using that approximate solution in place oi f{-,r]) in equations (j24p and (j25p to compute 
the gradients by another application of the Runge-Kutta method. Note that, if we evaluate /(•, r]) 
on the grid points {0, h, 2h, • • • }, by the above procedure, we will obtain the gradients -§^f{-,v) 
a rougher grid: {0, 2h, Ah, • ■ ■}■ 

Derivatives of the sample paths {Xii{-)} with respect to {a, 6, (3) 

Differentiating ()20p with respect to the parameters, we have 



dXujt) 
dan 

dXgjt) 

dXujt) 
d/Sr 



1 + 



M 



k=l 



^^^e'^Y.M'kiXuis)) + e^' 5]/3,0fc(X,,(s)) 



de, 

dXujs 

dpr 



k=l 

M 



k=l 



ds 



k=l 



ds, 



(26) 
(27) 
(28) 



for i = 1, • • • ,n;l = l, 
equations: 



d_ 

di 



X^-it) 



dt 



, Ni;r = 1, ■ ■ ■ , M. In another word, these functions satisfy the differential 

AI 

E/^fc-^U^.l*)), ^7(0) = 1, (29) 

k=l 

M M 

Xfi {t)e'^ Yl M'k{Xii{t)) + e^' Yl Mk{Xu{t)), xfi (0) = 0, (30) 

k=l k=l 
M 

xi'at)e'^Y.^MXu{t)) + e'^MXiiit)), xf^(O) = 0. (31) 



k=l 
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Using similar arguments, it follows that the Hessian of Xii(-) with respect to (3, given by the matrix 



(^f"'^''')^r'=i' where X^^'''"'-' (t) := qs%b , Xii{t), satisfies the system of ODEs, for r, r' = 1, • • • , M: 



M 



X^-"^' it) {Xa it)) + Xf; m., [X.a [t)) + X^f' ^ {Xu (t)) 



k=l 
M 



xf^^{t)x^^^''{t)Y^Mi{Xu{t)) 



k=l 



, x^r'''io)^o. 



(32) 



The Hessian of Xii(-) with respect to 6i, given by X-^' \ satisfies the ODE 

M M 



d 
It 



Y^Mki^uit)) + ixy^it) + 2xii{t))Y,M'k{xu{t)) 



.k=l 



k=l 



M 



+ {xli{t)fY.M'L{Xu{t)) 



k=l 



Xyi(Q) = 0. 



(33) 



The Hessian of Xii{-) with respect to an, given by X^^^' satisfies the ODE 

M 



d_ 
di 



X7'"-(t) 



X^-''^-it)YPkcPi{Xu{t)) 



k=l 



M 



+ (x;;"(t))2^/3,<^','(x,(t)) 



k=l 



X^^«'"«(0) = 0. 



(34) 



Also, for future reference (even though it is not used in the proposed algorithm), we calculate the 

mixed partial derivative of Xii{-) with respect to 9i and Pr as xfi'^''{t) := ^q^.q^^ which satisfies 
the ODE 



d 



M 



fc=i 



M 



X^fit)YM'kiXait)) + MXiiit)) 



k=l 



M 



+ xl'itMXait)) + xl'it)X^fit)YPk€iXu{t)) 



k=l 



JSs:f-^'(0) = 0. (35) 



Thus, the approach described above shows that as long as we have evaluated (approximated) the 
function Xii{-) at the grid points {0 + mh/2 : ?n = 0, 1, . . . , 2/h}, we shall be able to approximate 
the gradients ^ai') and {X^f}rii at the grid points {0 + m/i : m = 0, 1, . . . , l/h}, and the 

Hessians X^'J"'"^", xfl'^' and (xf'''^'-')^^^,^^at the grid points {0 + 2mh : m = 0, 1, . . . , l/{2h)}, by 
successively applying the 4*^* order Runge-Kutta method. 



Expression when g is positive 

Note that (j29|) . (j30p and (j31|) are linear differential equations. For the growth model we have 
g positive and the initial conditions an also can be taken to be positive. If the function (7^ := 
'Yl!k=ih<Pk is also positive on the domain of {ajjj's, then the trajectories Xii(t) are nondecreasing 
in t (in fact strictly increasing if 5^ is strictly positive). In this case, and more generally, whenever 
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the solutions exist on a time interval [0, 1] and g/3 is twice continuously differentiable (so that 
the solution paths for Xf{' , xf/', X^f , X^)"'"^', and x'^'''^^' are functions on [0,1]) the 

gradients of the trajectories can be solved explicitly: 

X^iit) = e'Hg,3{Xiim (37) 
Xf^(i) = g^(Xu{t)) T'^'^ j^^dx. (38) 

In the following, we verify equation(|38p. The proofs for others are similar and thus omitted. We 
can express 

t 



= e'^ f UXuis)) exp ( /* ^-^^^^^X[i{u)du\ ds (using X[i{u) = e'^gp{Xa{u)) 

Jo \Js 9f3{Xil{u)) J 

= e^' [ MXii{s)) exp{log g^{Xu{t)) -log g^{Xu{s)))ds 
Jo 

= gf3{Xii{t)) / - — Xii{s)ds 

Jo {gf3{Xu{s))y 



4>rix) 



Using analogous calculations, we can obtain the Hessians in closed form as well. Thus, solutions 
to (IMI), dSl) and (IMI) become 

xT'^m = (^§fy)2[^^M^^^Ki))-5M^^Ko))]; (39) 

Xf-'^it) = e'^g,3iXumt + e'H'g'f,iXaitm (40) 

X^{'^^\t) 

= gp{Xu{t)) / — - [(t)',{x){Fr,{x) - Fr.{Xam) + Mx){Fr{x) - FriXam)] dx 

JxaiO) 9f3(x) 

+{FriXu{t)) - Fr{Xam){Fr'{Xu{t)) - Fr>{Xum)9^{Xii{t))g'^{Xii{t)) 

Xa(t) g' Jx) 

-e-'^g^{Xu{t)) I . ..3 [(l)r{x){Fr>{x) - Fr>{Xa{m + cl>r'{x){Fr{x) - F,(Xiz(0)))] dx 

«(o) {9i3[x)y 

(41) 



X. 



where, for xi < X2, 

d) (y) 

Fr{x2) - Fr{xi) = / ,,' dy, 1 < r < M. 

Jxi {9f3{y)r 
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We can express X^^'''^''' (t) alternatively as 



1 



+e''gf3{Xu{t)) 



9i3{Xii{s)) 

■t 



'x^{{s)ct>'AXu{s))+ct>'AXu{s))X^{'{s) 



dt 



/o 



(42) 



Similarly, we have the representation 
+e'^g^iXuit)) 



9/3iXu{s)) 
* 1 



9f3{Xii{s)) I 



Xll{smxa{s))ds 



'x^{{s)g'^{Xa{s)) + UXu{s)) + Xli{s)X^{ {s)g'l,{Xu{s)) 



ds. 
(43) 



Appendix B : Levenberg-Marquardt method 

The Levenberg-Marquardt method is a method for solving the nonlinear least squares problem: 

n 

min5(7) where S{^) = ^[y^ - /i(7)]^ 

where fii'jYs are nonlinear functions of the parameter 7 £ M^. The key idea is to linearly ap- 
proximate /i(7 -|- 5) fiil) + JT^^ foi' ^ small 6 G M^, where Jj is the Jacobian of fi at 7. 
Denote 

y = (yi, . . . , vnf, f(7) = (/i(7), • • • , fn{l)f, 
and J to be the n x p matrix with rows , . . . ,J^. The resulting linearized least squares problem 
involves, for given 7 solving for 6 the equation 

(J^J + A diag(j'^J))5 = J^(y - f (7)), (44) 

for a regularization parameter A > 0. Note that, this solution bears similarity with the ridge 
regression estimate. However, the formulation in (j44p is according to the observation by Marquardt 
that if each component of the gradient is scaled according to the curvature then there is a larger 
movement in the directions where the gradient is smaller. In practice, the regularization parameter 
A is chosen adaptively to facilitate convergence. 



Appendix C : Newton-Raphson procedure 



We briefly describe the key steps of the Newton-Raphson procedure for optimizing the objective 
function As in the implementation of the Levenberg-Marquardt algorithm, we break the 

iterative procedure in three steps. The update of a is still performed by the Levenberg-Marquardt 
algorithm Q, while keeping 6 and (3 fixed at the current values. However, we employ Newton- 
Raphson to update and f3. Fixing a, (3 at the current estimates a* and /3*, respectively, we 
update 9iS from the current estimates 9^ by 



■tna 



EE 

1=1 j=i 



-I -1 



92 



'ilj 



EE 

1=1 j=l 



(45) 
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where the quantities on the right hand side are all evaluated at {a* ,6* , (3*), and 



"2eiij-2^Xji(ty7) + 2{—Xii{tiji)f + 2- 



Similarly, the Newton-Raphson update for /3 is given by 



(3* 



n Ni mu 



^ '^aj _^ 2B 



^ ' n Ni mu 



EEEif + 2Br|, (46) 
i=i 1=1 j=i '-^ 



where the quantities on the right hand side are again evaluated at {a* ,6* , f3*), and 
diiij ~ 9 ~ 



9/3 "'9/3' 

''^ = 2—Xii{tiij) (^—Xii{tiij)^ -2eiij-^^^Xii(tiij). 



Appendix D : Cubic B-spline fits to the plant data 

We first consider the control group. In Table HI we report the results using B-spline basis with 
knots at 1 + 11.5(1 : M)/M for M = 2, 3, ••• , 12; and using cjf ^ = 0.05, and erf* = 1 as initial 
estimates. In the B-spline fitting, we set the penalty matrix B to be the zero matrix, that is 
Xji = 0. In the Newton-Raphson step, both Ai and A2 are estimated adaptively from the data. 
However, the Levenberg-Marquardt step is non-adaptive, that is it uses the initial values of Ai and 
A2 throughout. From Table HI for M= 2 to 8 there is no convergence. For M = 9 to 12, the 
approximate CV scores are quite similar and the minimum is achieved at M = 9. 

We then consider the fits for the treatment group. The results using B-splines with knots at 
1 + 9.5(1 : M)/M for M = 2, 3, • • • ,12; and using fjf * = 0.05, and erf ' = 1 are reported in Table 
[5j We again set Ai? = (that is no penalty). As for M = 2 to 6, there is no convergence. For M=7 
to 10, the CV scores are similar and the minimum is achieved again at M=9. For M = 11 and 12, 
the method breaks down due to numerical instability. 

Appendix E : Proof details 

In this section we provide the proofs of the key asymptotic results. 
Proof of Theorem 1 

For convenience, we introduce the following notations: 

Ni mi n 

4(0i,/3) :=5^^^ii,(ai,,0„/3) and £..(0,/3) := ^4(^i,/a). 

1=1 j=i i=i 
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Table 4: Approximate leave-one-curve-out CV scores for control group. Cubic B-spline basis with 
knot sequence 1 + 11.5(1 : M)/M; and cr*™ = 0.05, cr™* = 1. (* = no convergence) 



M 


# L-M 


#N-R 


CV score 


2* 


216 


1000 


489.01162 


3* 


445 


1000 


416.69848 


4* 


458 


1000 


91.21308 


5* 


546 


1000 


74.12581 


6* 


337 


1000 


58.25487 


7* 


279 


1000 


53.69243 


8* 


190 


1000 


53.37721 


9 


233 


195 


53.16987 


10 


147 


120 


53.26008 


11 


94 


79 


53.26125 


12 


78 


54 


53.41077 



Table 5: Approximate leave-one-curve-out CV scores for treatment group. Cubic B-spline basis 
with knot sequence 1 + 9.5(1 : M)/M; and a*"* = 0.05, CTg™ = 1. (* = no convergence) 



M 


# L-M 


#N-R 


CV score 


2* 


228 


1000 


348.65867 


3* 


426 


1000 


422.03137 


4* 


233 


1000 


96.66250 


5* 


257 


1000 


71.77904 


6* 


539 


1000 


65.85252 


7 


336 


277 


64.25370 


8 


197 


143 


63.91828 


9 


125 


83 


63.83346 


10 


94 


38 


63.90003 


11* 








12* 
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Here, := {62, ■ ■ ■ , On)'^ since 61 = 0. For a > 0, define 



, s.t. 



+ II s 



1}. (47) 



We use Xf^ - to denote Xii{Tij;aii, g) where Xii{-) is the solution of the equation x'{t) = e^^ g{x{t)) 
with x(0) = an- We use X-ii{-;9,l3) to denote the solution of (fT7|) when Xj;(0) = Oj;, and 
:= Xu{Tij;e,f3). We define 0,/3), X^f{--e,(3), Xy^{--e,P), X^-^^ {--e , (3) and 

X^f'^'''{-;G,l3) as the partial derivatives and mixed partial derivatives of Xii{-;6,(3) with respect 
to {6i,9i), {9i,(3r) and (/3r,/3r')) respectively. Notations such as X^ij{6,P) are used to mean 

X^i''(Tij;d,P). We use to denote the function J2k=i l^kfpk (for convenience henceforth dropping 
the subscript M from (j)k,M) and denote its first and second derivatives by g'j^ and (7^, respectively. 
Finally, we use || • ||oo to mean || • ||ioo(£)), and denote the operator norm of a matrix and I2 norm 
of a vector by || • ||. We use T to denote {Tij : j = 1, . . . ,mi]i = l,...,n} and e to denote 
{euj : j = 1,. . . ,mi;l = I, . . . ,Ni;i = 1,. . . ,n}. 



Let T] e M""^ and S S 
7 1 — / - - -1^ 



be arbitrary vectors satisfying || rj 



+ 



1. Define 



'n-l 



— and observe that YA=2i^i 

n Ni rrii 



6' Jn~iO. Define 



dX 

Wf, := Y,^Y.^Yii,-XiiiTij-aii,e*,P*))—^iTij-,aii,e*,P* 
i=i 1=1 j=i 



and We to be an (n — 1) x 1 vector with (i — l)-th coordinate 

Ni mi „ 

1=1 j=i ' 
for i = 2, . . . ,n. Also let W := (W^, Wj)'^ . Then by a second order Taylor expansion, we have, 

e„{e* + aN'n,(3* + unS) - e.Xo*,P*) 



We 



T Ti 



2 r si 





d' 









'6 







(48) 



where {6,j3) satisfies || /3 — /3* ||< qat and || 9 — 6* \\< un- Note that (0,(3) depends on (a,T) and 
(r/, (5), but not on e. In the above, Qpfi{6,(3) is the M x M matrix 



n Ni nii ^r.y- ^ 

i=l 1=1 j=l ^ ^ 



dXi, 



d(3 



Qgj3{6,P) is the (n — 1) x M matrix with (i — l)-th row 

'dXii 



Ni rrii 



(=1 j=i 



dp 



2,...,n; 
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Qg0{6,P) is the (n — 1) x (n — 1) diagonal matrix with the (i — l)-th diagonal entry 



"H /ay \ ■ 



i = 2, . . . ,n; 



g00{e,/3) = gep{e,(3Y; npp{e,f5) is the M x M matrix 



n Ni nii 



i=i 1=1 j=i 

TCgi3{6,P) is the (n — 1) x M matrix with (i — l)-th row 



Ni nii 



yZyZO^iij - Xii {Tij ; au ,9i,f3)) ^ {Tij ; an ,0i,f3), i = 2,...,n; 
TCgg{0,(3) is the (n — 1) x (n — 1) matrix with (i — l)-th diagonal entry 

Ni rrii 



EE*^^*'J' ~ XiiiTij; au^Oi, fi))-—^{Tij; aii,9i, P), i = 2,...,n; 
1=1 j=i 



and HfseiO, f3) = Hgi3{6, (3)^ . Let Q^^e, G^^/se, G*,ef3 and G*,i3i3 denote the expectations of Gee{^* .P*), 
G/3e{S* , P*), ^6/3(^*1/3*) and Gpp{0* , (3*) with respect to {a,T). For future reference, we define the 
(M + n — 1) X (Af + n — 1) symmetric matrix ^(0, /3) as 



Gi9,P) 



'Gppi9,(3) Gp9{e,(3) 
Ge(3iO,(3) G9e{e,p) 



We define 7i{d, f3) and G* analogously. 

The following decomposition of the residuals is used throughout: 

Yuj - Xuj{e,(3) = euj + {Xfi^ - Xuj{e\(3*)) + {Xu,{e*,(3*) - Xu,{e,(3)). 



(49) 



Without loss of generality in the following we assume that oatM'^/^ — > 0, so that in particular the 
bounds ()6ip - ()69p are valid. The proof of Theorem 1 then follows from the following sequence of 
lemmas. 

Lemma A.l : Let 7 = {6^ ^vj^)'^ , and W be as defined earlier. Then, with probability tending to 
1, uniformly in 7 such that \\ 7 ||= 1, we have 



\l'^W\ 



0{aeM^I'^^\o^{Nm)) + 0{M-P{Nm)^^^) ^ -f'^G{e* , (3*)-f . 



(50) 



Lemma A. 2 : With 7 as in Lemma A.l, uniformly over 7, we have 



.T 



7 



'Gpp{e_,^-G(3p{e\f3*) Gpe{e_,P) 
Gep{0,f3)-Gep{0*,f3*) Gee{0,P) 



Gpe{0*,(3*] 
Gee{0*,(3*) 



7 



0(a7vM3/2(iVM)i/2)^^Tg(6»*,/3*)7 + 0{a%M'^Nrn). 



(51) 
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Lemma A. 3 : There exists a constant C7 > such that, with 7 as in Lemma A.l, uniformly over 
7, we have 

7^g(6>*, /3*)7 > 7'^a*7(l - op{l)) > cjK-lNm{l - op{l)). (52) 



Lemma A. 4 : With 7 as in Lemma A.l, with probability tending to 1, uniformly over 7, 



T 

1 



rCepie, (3) - rLep{e*,(3*) nee{0, f3) - neeiO*,P*). 



7 



+0{aNM^'^-PNm) + 0{aeaNM^(Nmf/^^log(Nm)). (53) 

Finally, using (09]), ([60]), and ([65])- ([69]), we have 

max{|| nf3p{e*,p*) II, II ni3e{9*,p*) II, II nee{0*,P*) \\} 
= Op{aeM{Nmf/^) + 0{M-^P-^^Nm). (54) 

Combining ()5ip - (I54p . from (I48p . with probability tending to 1, uniformly in 7, 

£..(r + a^T7,/3* + a7v5) - L.{0\P*) 



UN (0(a,Afi/yiog(iVm)) + 0(M-P(A^m)i/2) + 0(a^M3/2(iVrn)i/2) j ^^Tg^Q*^^*^^ 



-aN\2 II r II 0{l) - alO[aNM^/^{Nrnf'^)^/i^Q^ 
-a%0{{aNM^/^ + a^M^ + q^M^/^-p ^ M-(f-i))iVm) 

-a^O((f7eaivM3 + M) {Nrnf'^ log (Nfn) ) 

> C4%/airiVm(l - op(l)) (55) 

where C4 > is some constant. The last step uses Lemma A. 3 and the following fact: 

[Q ] For any positive definite matrix A, with || A"^ \\< k, if 2c-y/K < 1, then for all x such that 
II X 11= 1 

1 



x^^x - cVx^Ax > -x^Ax 
- 2 



Thus, with probability tending to 1, there is a local minimum (0,(3) of the objective function ([5]) 

4- 



with II — 6* IP + II j3 — f3* |p< alf. This completes the proof of Theorem 1. 



Proof of Theorem 2 

We make use of the following inequality due to Halperin and Pitt (Mitrinovic, Pecaric and Fink, 
1991, page 8): // / is locally absolutely continuous and f" is in L2{[0,A]), then for any e > the 
following inequality holds 

A 

r/2 



/ f"<K{e) f + e r 
Jo Jo Jo 
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where K{e) = l/e + l2/A^. 

Define Xi{t,x) as the sample path Xnit; au^O* , 13*) when an = x. Since 6i = and (•;0,/3) 
is given by psp (Appendix A), in order to prove Theorem 2, it is enough to find a lower bound on 



mm 

l|b||=l 



gb{Xi{u,x))/gf3'. {Xi {u,x))du 



fT{t)dtdFa{x) 



where gh{u) = b-^<^(u). By A5, without loss of generality we can take the density /r(-) to be 
uniform on [0, 1]. Let 



Then, 



r{t,x) 
r'{t,x) 



R{t,x) := / gh{Xi{u,x))/g/3*{Xi{u,x))du. 

JQ 

gh{Xi{t,x)) 



d 



d_ 

dt 



r{t, x) 



g[,{Xi{t, x)) 9h{Xi{t, x))g'^,{Xi{t, x)) 
9UXi{t,x)) 



gf3'-{Xi{t,x)) 
x)) 9b{Xi{t, x))g'^, {Xiit, x)) 



X[{t,x) 



9(3* {Xi{t,x)) 



9l*{Xi{t,x)) 



9(3'{Xi{t,x)) 



From this, and the fact that the coordinates of <^'(n) are of the order 0(M^/^), coordinates of 

are of the order 0(M^/^), and all these functions are supported on intervals of length 0(M~^), we 

obtain that, uniformly in x, 

[ {r'{t,x)fdt = 0{M'^). (56) 

JO 

Application of Halperin-Pitt inequality with /(x) = R{t, x)^dt yields 

{r{t,x)fdtdFa{x) < (l/e + 12) j {R{t,x)fdtdFa{x) + e j {r' {t, x)fdtdFa{x). (57) 
Take e = koM~'^ for some /cq > 0, then by ([56]) . 



J {R{t,x)fdtdFa{x) > kiM-^ j 



{r{t,x)fdtdFa{x) - fcaM" 



for constants ki,k2 > dependent on /cq. Rewrite / f^{r(t,x))'^dtdFa{x) as 



dvdFa{x) = I g^{v)h{v)dv 



(58) 



where h{v) = g^*{v) J 1|^<„< Yi(i,x)}'^-^a(a;). If the knots are equally spaced on [xq + 5,xi — 6] for 
some constant (5 > is bounded below, then ini^^Dg h{v) is bounded below (even as M — > oo) 
where Dq is the union of the supports of {(^fc,M}fclii which contained in [xq + 5/2,xi — 5/2] for 
M sufficiently large). In this case, 

^^dv > ks for some constant k^ > 0. Thus, by 
appropriate choice of e, we have / {R{t , x))"^ dtdFa{x) > k^M""^ for some k^ > 0, which yields 

KM = 0{M^). 
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Proof of Proposition 1 



The proof is based on the foUowing lemmas. 

Lemma A. 5: Let Vd he the class of all polynomials p{x) = X]j'=o Pj^'' of degree d on [0, 1] such 
that \p\oo = 1- Then there exists a constant c > such that 

\p\oo > c max 

0<j<d ■' 



Lemma A. 6: Let ^ he a measure on the interval [0, 1] with the property that for any L > 0, there 
exists a constant C{L) > such that for any interval A C [0, 1], n{B)/ fj,{A) > C{L) for all intervals 
B G A with length{B) /length{A) > L. Then for any polynomial p of degree d on [0, 1], there exists 
a constant c > such that 

I p'^dfi > csup \p{u)\'^ fi{A) . 

For the next lemma, assume that the knots are ti = • • • = td+i = 0,tM+i = • • • = tM+d+i = 1 
and < < • • • < tM < 1- Note that we have placed extra knots at and 1 in order to obtain 
a B-spline basis. Let ip := : j = 1, M} be the (unnormalized) i?-spline basis with the knots 
{tj : j = d + 2, M}. Let (3 G M*^, and consider the spline s{x) := Yljii Pj'^ji^)- Then on the 
interval Ai := s{x) = J2i^d<j<i (^j^ji^) with J2,^d<j<i^jix) = 1- 

Lemma A. 7: Assume that ^ is a measure on [0, 1] satisfying the properties of Lemma A. 6 ahove. 
Consider the vector ip of B- splines on [0, 1] of degree d with well- conditioned knots at td+2^ ■■■,tM, 
i.e., the sequence {M(tj_|_i —ti) : i = d+1, M} remains hounded hetween two positive constants for 
any M. Then there exist constants ci2, C13 > (which do not depend on ...jtM) such that all the 
eigenvalues of the matrix J xj^tlj^dii are hetween C12 minrf+i<j<A/ and C13 maxd+i<i<A/ 

Lemma A. 8: Let h he a hounded nonnegative function on [0, 1] which is bounded away from zero 
except perhaps near and 1. Assume that lim^j-^o ^^^'^^(a;) and lim2;_^i(l — x)~"'h{x) are positive 
constants for some < 7 < 1. Let if) he a (unnormalized) B-spline hasis (as in Lemma A. 7). Then 
all the eigenvalues of J tptp'^hdx are hounded hetween ciqM'^'"' and ciiM~^ for some positive 
constants cio,cii > 0. 

Observe that under the stated condition on the density of Fa in the proposition, the function h{v) 
appearing in (I58p has the same behavior as stated in Lemma A. 8 (after a change of location and 
scale). Proposition 1 now follows from using Halperin-Pitt inequality as in (j57p . but now taking 



Rate bounds 

In this subsection, we summarize approximations of various quantities that are useful in proving 
Lemmas A.1-A.4. First, by A3 we have the following: 

II - 4' ll~= 0{aNM^+^/^) if II /3 - /3* ||< a^, j = 0, 1, 2. (59) 

Next, from A3 and A4, for M large enough, solutions {Xii{t]6,f5) : t G [0, 1]} exist for all {0,(5) 
such that max{|| 6 — 6* ||, || (3 — (3* \\} < ajy- This also implies that the solutions X'^i'{-;6, (3), 

X^f{-,6,p), X.l''^'{-,6,f3), xf/'^'-(-;6>,/3) and X^^'^^' {■■,6, (3) exist on [0,1] for all (6>,/3) such that 
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max{|| 6 — 6* \\,\\ f3 — (3* ||} < a^v, since the latter are linear differential equations where the 
coefficient functions depend on Xii{t;6, f3) (see Appendix A). Moreover, by GronwaU's lemma 
(Lemma F.l), ([59]) and the fact that || 5^? ||oo= 0(1) for j = 0,1,2 (again by A3), all these 
solutions are bounded for all 6i and an, by compactness of supp(Fa). 
Hence, if otyM^/^ ^ g^]^)^ ^i^^^ 

using Corollary F.2 (in Appendix F), the fact that || g^* ||oo — 
0(1) for j = 0, 1, 2, and the expressions for the ODEs for the partial and mixed partial derivatives 
(see Appendix A), after some algebra we obtain the following (almost surely): 



\\Xu{-,6*,f3*)-Xf^{-) \\^=0{M-P). 
The same technique can be used to prove the following (almost surely): 



(60) 

(61) 
(62) 
(63) 

(64) 

(65) 
(66) 

(67) 

(68) 

(69) 

whenever max{|| 6 — 6* ||, || /3 — /3* ||} < on- 

To illustrate the key arguments, we prove ([63]) and ([M]) . By ([38]) . and the fact that || (j)r 1 1 00 = 
0(M^/^) and is supported on an interval of length 0{M~^), (j63p follows; in fact it holds for all 
{6,(3) G r2(ajv) with Q.[a) as defined in (j47p . Next, note that the function (p^f. is Lipschitz with 
Lipschitz constant 0(M^/^) and is supported on an interval of length 0{M~^). Since (j3ip (in 
Appendix A) is a linear differential equation, using Corollary F.2 with 



6, (3) — Xii{ 


,6* 


f3*) 


1 00 


= 0{aNM^/^) 


\,X^{.-6,f3)-X^{ 


,6* 


(3*) 


1 00 


= 0{aNM^/'^) 


max X^''( 

l<r<M " 


,6* 


(3*) 


1 00 


= o{Nr'^i^) 


max \\X^;i-,6,p)- X^;( 

l<r<M " ^ ' '^^ ^ 


,6* 


13*) 


1 00 


= 0{aNM) 


\\xy^{--6,(3)-xy^{ 


,6* 


13*) 


1 00 


= 0{aNM^/^) 


max II X^/'^'-( 

l<r<A/ *' 


,6* 


(3*) 


1 00 


= 0(Mi/2) 


max \,X^^^^{.-6,(3)-Xy^{ 

l<r<M 


,6* 


(3*) 


1 00 


= 0{aNM^) 


max X-, ( 

l<r,r'<M " " 


,6* 


13*) 


1 00 


= 0{l) 


max \\x'^;''^-^'{--6,l3)- X'^;'^^'{ 


,6* 


(3*) 


1 00 


= 0{aNM^''^) 



6f{t,x) 



e'^g'f,{Xa{t;6,(3))-e'*g'^.{X.a{t;e*,(3*)) 



.iXu{t;6,P))-e'*MXiiit;6*,p*)) 



= {e''-e'^ [xg'f,{Xii{t;6,l3)) + MXiiit;6,p))] + xe'' {g'^{Xu{t-6 , (3)) - g'f,{Xa{t;6* , (3*))) 
+xe'Hg'^{Xu{t-,6* ,(3*))) - g'^,{Xu{t-,6* ,fi*))) + e'H<Pr{Xu{t-^^ 

we obtain ([MD by using <^ and the following facts: on [0, 1], \X'^{{t)\ = 0{M~^/'^) for all (0,/3) G 
II g'li ||oo= 0(a7vM5/2); II g'^-g'^. \\^= 0{aNM^I'')- and aNM^/^ = o(l). 

Proof of lemmas 

Proof of Lemma A.l : Using (01]), write 

n Ni mi 

Dih) := j^W = Y,Y.^ieuj+Auj)j^vuj, 

i=i 1=1 j=i 
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where Aj;j := Xf^ - — Xnj^O* , f3*), and vnj is the (M + n — 1) x 1 vector with the first M co- 
ordinates given by v(^^. := xfi{Tij; an, 9* , f3*) , and the last {n — 1) coordinates given by v^^j := 
X^J^iTi^j] an, 9* , f3*)ei^i, where ej is the z-th canonical basis vector in M'"~^, and eg := 0„_i. Notice 

=i^iij^Iij- Thus, by Cauchy-Schwarz inequahty, and the fact that 
maxi^lj \Aiij\ = 0{M~P) (by (j60p ) we have, uniformly in 7, 

n Ni mi 



I 5^ 5^ 5] A,,, 7^ V,,, I = 0{M-P{Nm)^/')^jTg(^e*,/3*)j. 
1=1 1=1 j=i 

Since enj are i.i.d. N{0,a^), we also have 

n Ni rrii 

i=i 1=1 j=i 

conditional on (a,T). Since the (conditional) Gaussian process 

1=1 1^1=1 Z^j=l ^■tiji ^■tij 



fil) 



is a smooth function over S^^+"~i (the unit sphere centered at in M*-'^"*'""^), and since by as- 
sumption M = 0{{Nm)'^) for some d > 0, using a covering of the sphere §*^+"-i by balls of radius 
ej\/ ^ {Nfri)~^ for an appropriately chosen D > 0, and using the fact that P(A^(0, 1) > t) < 
t~^{2TT)~^^'^ exp{—t'^ /2), for t > 0, we conclude that uniformly in 7, 



n Ni mi 

i=i 1=1 j=i 

except on a set with probability converging to zero. This completes the proof of the lemma. 

Proof of Lemma A. 2 : Define unj the same way as vuj is defined in the proof of Lemma A.l, 
with (61*, /3*) replaced by (e,P). Express L>2(7) := f^{g(e,p) - g{e*,f3*))f as 



T 

7 



7 



n Ni mi n Ni mi 

E E E - E E E ^^'i^^j ^ 
i=i 1=1 j=i 1=1 1=1 j=i 

n Ni mi 

1=1 1=1 j=l 



7- 



Then, by Cauchy-Schwarz inequality and (j62j) and (|Mj) . and the arguments used in the proof of 
Lemma A.l, 



|D2(7)I = 0iaNM^/\Nm)^/^)J-f^gie*,(3*)-f + 0ia%M^Nm). 



Proof of Lemma A. 3 : Define 1)3(7) := 7 {G{0*,l3*) - G^h- Then 

n Ni mi n Ni mi 

^3(7) = E E E + E E E ^*'^(^)' 

i=i 1=1 j=i i=i 1=1 j=i 
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where Uiij{-f) = Y {ViXujV,Xli.-E[{ViXiijViXli.)\aii]h and wujh) = Y {mV^XiljViXl^■)\ail 
iXiijV iXfi-)])^ , where, for notational simphcity, 



,n. 



Note that, the random variables Uiij{-f) have zero conditional mean (given an), are uniformly 
bounded and the variables Zij{'y) := Ylj^=i ^Ujil) ^'^s independent. Similarly, the random variables 
{wiii}i,j have zero mean are uniformly bounded and the variables 'Yl^=i''^i.ij{l) ^^^^ independent. 
Indeed, for each fixed (i, I), the variables {wuj^^l^ are identical since Tjj are i.i.d. Moreover, the col- 
lections {uiij{'y)} and {wiij{'y)} are differentiable functions of 7. Define Q^{a) := E(^*(0*, /3*)|a). 
Then, since Zij{-f) are uniformly bounded by KiN for some constant Ki > 0, and are independent 
given a, we have 



Var(5^5]Z,,(7)|a) = ^ J] £[(^,,(7)) V] 

i=l j=l 1=1 j=l 



n mi Ni 

i=i j=i 1=1 

n Ni rrii 

i=i 1=1 j=i 



In the above, second inequality uses i^iLiXi)'^ < X'^^^^^x'f, and the last follows from fact that 
Uiij{'j) is a difference of two nonnegative quantities, the second one being the conditional expec- 
tation of the first one given a. Thus, applying Bernstein's inequality, for every v > 0, for every 



TV 



n Ni rui 

^(lEEE^^'^-(^)l >^l«) ^2exp 
i=l 1=1 j=l 



2i^2iV7^^*(a)7 + KiNv/3 



Thus, using an entropy argument as in the proof of Lemma A.l, we conclude that given 6 > there 
exist positive constants Ci{6) and 6*2(5) such that on the set {a|7-^^^,(a)7 > C2(5)A^M log(A^m)}, 



sup 



EUEi=iET=i^u,{i)\ 



> Ci{6){NMlog{Nfn)y/^ \ a] < {Nm)-\ 



On the other hand, using an inversion formula for block matrices 



(70) 



(71) 
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where C^, := G*,i3f3 — Q*,f3eiQ*,ee) ^G*,ef3- The last equahty in ([7T]) is because A6 together with ([63 
imphes in particular that || G*.f3e{G*fie)~^ ||= 0(1). Now, from the facts that 



7 y*7 > 



Nm 

KM 



(by ([n])) and mm{N ,m} KMMlog{Nrn), 



for some constant > 0, so that 7-^^*7 ^ mM log{Nm), and using arguments similar to those 
leading to ([70|) we have, for some 03(6) > 0, 



sup 



Si=l Z]/=l Z]j=l ^i«i(7)l /.w_,,, /-jr7_NNi/2^ ^ /-jr7_N_5 
> C3(oj(?TiM log(iVmjj ' < {Nm) . 



(72) 



Now, observing that Yl,il\ Y^=\ "^Hj^l) = i Q ~ l Q*l, using fact [Q], and combining 

with (1701) and ()72p . we obtain that, there is a C4((5) > such that 

P ( sup > Ci{6){m^'^ + N^'\M\og{Nm)flA = 0{{Nm)"^). (73) 

(Note that, if the time points {tuj} were independently and identically distributed for different 
curves (z, /), then quantity (mV2+iV^''^)(M log(iVm))i/2 in ([73]) can be replaced by (M log(iVm))i/2). 
From (1731) it follows that 



7^g(r,r)7 > 7^^*7(1 - op(l)) > CQKl}Nm{l-op{l)) 
for some constant cg > and for sufficiently large A^. 

Proof of Lemma A. 4 : Using (142p and (I43p in Appendix A, for any t, we can express the 
(M + n - 1) X (M + n - 1) matrix with blocks Af'^^(t) := ((A^'''^'' (t)))*t,=i, ei„iA^^-^^(t), 
Xfi'^'{t)^ti and A{;-^'(t), as [/,;(*) + Ua{t)^ + FiK*) where 









'ct>'{X,i{s))- 


/o ff^(^i«('S)) 






0„_i 



and II Vii{t) 11= 0(1) uniformly in t, i and I. Note that, in the above description, all the sample 
paths and their derivatives are evaluated at (0*,/3*). 

Observe that, since Ynj — XuiTij; 6* , (3*) = suj + Auj, where Auj is as in the proof of Lemma 
A.l, we have 



nie,(3)-n{e*,p*) 

n Ni rrii 

i=i 1=1 j=i 

n Ni mi 

i=i 1=1 j=i 



■ X^f(6,W)_ X^f^)e[ 



ixti-\e,P)-X§f^i0*,(3*))ef_, 
e,^i{X^lf '(9,P)-X^if \e*,P*)) (A;;f (0,^3) - A^f (r,/3*))e,_ie^i 



(75) 
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First break the last summation in the last term of (j75|) into two parts - one corresponding to 
Aj/j's and the other corresponding to euj's. Then, using ([60]) . ([65]) . ([67|) and ([69]) . we conclude 
that the sum involving Auj is 0{ai\]M^/'^~^ NWi). The summation involving ejZj 's can be expressed 
as a linear function of e with coefficients that are functions of a, T and 7, and depend smoothly 
on 7. From this, conditionally on (a,T), this term is coordinatewise normally distributed with 
standard deviation 0((T£aArM^/^(A^m)^/^) for each fixed 7. We can conclude from this by an 
entropy argument (similar to the one used in the proof of Lemma A.l) that the supremum of this 
term over ah 7 G g^f+n-i -g 0{cjeaNM'^(NmY/'^ ■ 



'\og{Nm)) with probability tending to 1. 
Next, using ([71|) we express the first term of ([75]) as 



n Ni rrii 

i=i 1=1 j=i 

n Ni mi 

1=1 1=1 j=i 



ilj 



X^f {6*, (3*) 



ilj 



x^f'ie*,(r))eli 



(76) 



The second sum is 0{a%M^Nrn) by ([STD and ([MI)- Again, by the contribution 

in the first sum for the term involving Vii{tiijys is 0{aNM^^'^Nrn). Finally, by Cauchy-Schwarz 
inequality, and (j74p . and the facts that is bounded both above and below (for x > xq), we have 
for ah < t < 1, 



< c„ 




X 



ds 



.T 



7 



'cl^'{Xa{s)) 

On-l 



for some constant > depending on f?^*, 6 and xq. Notice that / (f)'{x){cf)'{x))'^dx < cjIvPIm 
for some C7 > 0. Furthermore, 

2 



sup sup 




4'(^)e.-i 



ds = 0(1). 



These, together with an application of Cauchy-Schwarz inequality, and using manipulations as 
in the proof of Lemmas A. 2 and A. 3, shows that the sum involving the terms Uii{tiij)'s in the 
expression (j76]) is 0{aNM^/'^{Nm)'^/'^)^/J^V7f■ Lemma A. 4 now follows. 

Proof of Lemma A. 5 : Suppose that the result is false. Then for any sequence of positive 
constant 5n decreasing to zero, we can find a polynomial pn with coefficients (3n.j, j = 0,...,d, 
such that \pn\oo < (^n maxo<j<rf Let qn be the polynomial whose coefficients {"fnj} are 

obtained by dividing f3j's by maxo<j<d |/3nj|- Note that max|7„j| = 1 for any n. By the usual 
compactness argument we can find a subsequence of {'jnj}, which we continue to denote by {'jnj}, 
such that 7„j — > 7j, j = 0, d, where maxj |7-,| = 1. However the supremum norm of the limiting 
polynomial q of degree d with coefficients 7^ is zero which implies that 7j = for all j. This leads 
to a contradiction. 

Proof of Lemma A. 6 : Note that for any interval A = [a,b], a < b, we can write f^p'^dfi = 
Jq q'^dfiA, where q{z) = p{a + (6 — a)z) and dfiA{z) = dfi{a + (6 — a)z). Since \q\oo = sup^gyj |p(^)l) 
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we may take \q\oo = 1- Let x* be a point in [0, 1] at which q{x*) equals ±1. Then for any < x < 1, 
q^{x) = 1 + {x — x*)^q{x**)q"{x**) for some x** in [0, 1]. Using Lemma A. 5, we see that there is a 
constant cu such that |p"|oo < C14 for all polynomials with |p|oo = 1- So \q'^{x) — 1| < ci4(x — x*)^. 
So we can find an interval / C [0, 1] of length at least L = (2ci4)~"^/^ containing x* such that 
> 1/2. Let B = a + {b- a)I. Then length{B) / length{A) = length{I) > L. Consequently, 

p'^dfj. = I q'^dfiA - ^ / '^AiA 



A Jo 



The result now follows. 



Proof of Lemma A. 7 : This result is clearly true for d = 0. We will prove it for the case d > 1. 
Let P eR^ and let s{x) = 0^iIj{x). Since s is a convex combination of /3j_d, on the interval 
Ai, we have 



d+l<i<M d+l<i<AI i~d<j<i 

This establishes the upper bound for the largest eigenvalue of J xj^xji^ . We will now establish the 
result on the lower bound of the smallest eigenvalue. 

Using property (viii) in chapter XI in de Boor (1978), we know that supj^^j<2.<^.^^^^ |s(x)| > 
cislftl for all i for some constant C15 > 0. Denote mo = mm.(i+i<i<M Hence for any 

d < i < M, by Lemma A. 6 we have 

ti+d+i rtj+i 

s dfi = y, / s dfi 

'+1 i+l<j<i+d+l *J 

i+l<j<i+d+l *j <^'<*j+i 

> c sup |s(x)p7no > ciePfmo. 

Incidentally, the same type of inequality holds for any i = 1, .., d—1. Consequently we have f s^d^ > 
C17 ^ PfruQ for some constant cn > 0. This completes the proof. 

Proof of Lemma A. 8: This follows from Lemma A. 7 once we take dfi{x) = h[x)dx. 



Appendix F : Perturbation of Differential Equations 

For nonparametric estimation of the gradient function 5, we need to control the effect of lack of 
fit to g (meaning that g may not be exactly represented in the given basis {</>*:(■)}) the sample 
paths {Xii{t) : t S [0, 1]}. It is convenient to do this study under a general setting of first order 
differential equations where the state variable x(-) is d-dimensional for d > 1. Our aim is to control 
the perturbation of the sample paths and its derivatives with respect to the parameters governing 
the differential equation when the true gradient function g is perturbed by an arbitrary function 

We present two different results about the perturbation of the solution paths of the initial value 
problem: 

x' = f{t,x), x{to) = xo, (77) 
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where x S M^, when the function / is perturbed by a smooth function. 

Theorem F.l (Deuflhard and Bornemann, 2002, p. 80) : On the augmented phase space 
let the mappings f and 6f be continuous and continuously dijjerentiable with respect to the state 
variable. Assume that for {tQ,xo) G 0,, the initial value problem (77^, and the perturbed problem 



x' = f{t,x)+Sf{t,x), x{to) = xo, (78) 

have the solutions x andx = x + 6x, respectively. Then for ti sufficiently close to to, there exists a 
continuous matrix-valued mapping M : A — > M'^^'^ on A = {(t,s) G : t G [to,ti],s G [to;^]} such 
that the perturbation 6x is represented by 

6x{t) = [ M{t,s)6f{s,x{s))ds, for all t e [to,ti]. (79) 

Jto 

Note that, the point ti can be chosen so that, the one-parameter family of initial value problems 

x' = f{t,x) + X-6f{t,x), x{to) = xo, (80) 

has a corresponding solution (/){■; X) G C-'^([to, ^i], IK"') for each parameter value A G [0,1]. In 
particular, (/>(•; 0) = x{-) and (p{-; 1) = x(-). 

Propagation matrix and its relationship to perturbation 

Let <!>*'*" denote the map such that x{t) = ^^'^"xq is the unique solution of the initial value problem 
(j77p . The following result (Theorem 3.1 in Deuflhard and Bornemann, 2002, p. 77) describes the 
dependence of the map on the gradient function /. 

Theorem F.2 : On the extended state space Q let f be continuous and p-times continuously 
differentiable, p > 1, with respect to the state variable. Moreover, suppose that for (to,a^o) £ ^ the 
unique solution of the initial value problem (f77| ) exists up to some time t > tQ. Then there is a 
neighborhood of the the state xq where for all s G [to,ii], the evolution 

X X 

is p-times continuously differentiable with respect to the state variable. In other words, the evolution 
inherits from the right side the smoothness properties with respect to the state variable. 

Then, the linearized perturbation of the state, due to a perturbation 6xs of the state at time s, 
namely, 

8x{t) PS ¥^'{x{s) + 8xs) - ¥^'x{s) 
is given by 5x{t) = W{t, s)5xs where 

W{t, s) = D^^'^'i |5=$Mo., G M'^^^'^ (81) 

is the Jacobi matrix. Note that, W{t, s) satisfies the differential equation: 

^W{t, s) = f^{t, ^'^'^xo)W{t, s), (82) 
at 

with initial condition W{s, s) = /. W{t, s) is called the propagation matrix belonging to x. 
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In general, we can express the matrix M(t, s) appearing in Theorem F.l as 

M{t,s) = [ W{t,s;X)dX, (83) 
Jo 

where W{t, s; A) is the propagation matrix belonging to (/>(•; A), and hence solves the homogeneous 
differential equation 

jW{t, s; A) = Ut, <p{t; X))Wit, s; A), Wis, s) = I. 
From this, the following corollary follows easily. 

Corollary F.l : // the limit 5f ^ is uniform in a neighborhood of the graph of the solution x, 
then the linearization 

5x{t)^ I W{t,s)5f{s,x{s))ds, for allt e[to,ti] 

Jto 

holds. 

Gronwall's Lemma and its implications 

Lemma F.l (Gronwall's Lemma): Let ip,x ^ C([^0) ^i]) I^) nonnegative functions and p>0. 
Then the integral inequality 

i^{t) <P+f x{s)i'{s)ds, for all t G [to, h] 

implies 

%l){t) < pexp x{s)^is)ds^ for all t G [to,ti]. 

In particular, ip = holds for p = 0. 

An immediate application of Lemma 1 is that, it gives a bound for || W{t, s; A) ||. Indeed, if 

II fxit; (Pit; A)) ||< x{t), for all A G [0, 1], (84) 

then taking ^p{t) =\\ W{t, s; A) || (note that ^p{•) depends on s) and p =\\ W{s, s; A) || = || I \\= 1, we 
obtain 

II W{t, s; A) ||< exp (^J^ x{u)du^ , for alHo < s < t < ti, for all A G [0, 1]. (85) 

Condition (|84p holds in particular if || fx{t, •) ||oo^ 9,nd then, from Theorem F.l, we obtain 
the important result. 

Corollary F.2 : If f is such that \\ fxit,-) \\oo^ xi^) for a function xi') bounded on [to,ti], and 
II ^fitr) \\oo^ Tit) for some nonnegative function r(-) on [to,ti], then 

rt \ 

xiu)du r(s)(is, for all t G [to>ii]- (86) 



Sxit) II < / exp r / 

J to \J s 



Note that, even though Mit,s) in (I79p in general depends on xq, the bound in (I86p does not. 
This has the implication that if one can prove the existence of solutions {4>i-; A) : A G [0, 1]} on an 
interval [toi^i] for an arbitrary collection of initial conditions xq, and the conditions of Corollary 
F.2 hold, then the same perturbation bound (j86p applies uniformly to each one of them. 
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